1
One of the minor peculiarities of writing about systems software in the present era is the Tour of the Sandbox. One reads a paper, reads some source, builds a small toy of one’s own, runs a few benchmarks — and then, improbably, writes down one’s impressions as if one had in fact been somewhere. I mention this at the outset because the reader deserves fair warning. What follows is a dispatch from such a tour, conducted over the past few weeks in and around gVisor, the userspace kernel Google has deployed to make the running of untrusted code a less dangerous proposition than it had been. I have not been to Mountain View, and nobody in particular invited me. The only credential I possess is that I have, in an earlier post or two, built a small and imperfect mini-sentry of my own, which is to gVisor approximately what a scale model of a Dreadnought is to a Dreadnought. My tour was conducted with a notebook, a Docker daemon, a bench harness of my own devising, and the disposition of a man who hopes to be useful to whoever comes up the road behind him.
What I bring back is, first, a collection of specimens — numbers, mostly — and, second, a set of impressions that the numbers by themselves do not entirely explain. I am going to attempt both, and I am going to try hard not to let the second tyrannize the first.
2
It is difficult, after a few weeks, to remember that when I first set out to understand gVisor I believed it was another container runtime. It is not. A container runtime, in the ordinary sense, is a tool for using the Linux kernel’s isolation primitives — namespaces, cgroups, seccomp — to persuade several processes to ignore each other’s existence. gVisor does something altogether different, and more audacious. gVisor pretends to be Linux. It sits in userspace, as a binary written in Go, and when a sandboxed process calls read or openat or socket, gVisor catches the call before the real kernel learns of it, does whatever the call is supposed to do from within its own quiet reimplementation of the relevant subsystem, and hands the result back. The sandboxed process lives out its whole life in the earnest belief that it is in conversation with Linux. It is, in fact, in conversation with something on the order of two hundred thousand lines of Go.
That is, one is obliged to admit, a different kind of system software. Namespaces and seccomp, however cleverly composed, in the end all let the guest’s syscalls land on the host kernel, and rely on the kernel to be both correct and trustworthy. gVisor declines to take that bet. The host kernel, in gVisor’s conception of things, is precisely the piece whose correctness one does not wish to have to depend upon. The insolence of the idea grows on one the longer one thinks about it.
3
A rough accounting may be helpful. Linux exposes something on the order of four hundred system calls — the exact figure depends upon how one counts the many subcalls of ioctl, prctl, fcntl, and their relations — but four hundred is a fair round figure. Each of these is, from the point of view of a guest process that wishes the host ill, a potential avenue into the most privileged region of the machine. A kernel bug lodged anywhere along that frontier, if reachable from the guest, is a way out.
The Sentry — which is the piece of gVisor that does the pretending-to-be-Linux — answers almost every one of those calls itself, in Go, without troubling the host kernel at all. The Sentry’s own interaction with the host, that is to say, what it itself needs in order to do its job, has been pared down with visible deliberation to about sixty-eight distinct syscalls; a seccomp filter nails it to precisely that set. The arithmetic is nearly the whole pitch: four hundred down to sixty-eight, a sixfold contraction of the frontier. It is the single compact argument that justifies the existence of the thing. What that contraction costs, in the running of a program, is the subject of the rest of this dispatch.
4
There are, or have been, three platforms in gVisor — three mechanisms by which the Sentry catches the guest’s syscalls — and one must visit all three, however briefly, to have any real picture of the landscape.
The first is ptrace, which uses the same mechanism a debugger does. The parent process stops the child at every syscall, inspects the registers, and either emulates the call or lets it pass to the host. It is the most honest of the three implementations in that it hides nothing; it is also ruinously slow, because every single syscall costs a context switch to the tracer and a context switch back, whether or not the Sentry has anything to do with it. Tight syscall benchmarks report a tax on the order of forty-to-one over native. This is the platform gVisor shipped with; it has since been, effectively, retired.
The second, and current default, is systrap. Systrap uses a seccomp filter that returns SECCOMP_RET_TRAP — a signal the Sentry handles in-process, with no tracer involvement — paired with a shared-memory region for argument passing. The whole business becomes less a stop-the-world interrupt and more a function call. The tax falls from ruinous to acceptable, and one can run gVisor in production on systrap without feeling foolish, which is not a trivial thing.
The third is KVM, which runs the Sentry inside a small hardware virtualization container and turns the guest’s syscalls into vmexits rather than signals. It has the best raw performance on many workloads and the most restrictive deployment prerequisites. Most of what one reads about gVisor in the wild is about systrap, for the very good reason that systrap is what ordinary installations default to. My bench harness was systrap throughout.
5
So I built a bench harness. The harness is small — five workloads, each chosen to exercise a different part of the stack — and I ran each one first under plain runc, the unremarkable OCI runtime, and then under runsc, which is gVisor’s. The ratio between the two is the specimen I came back with.
The first workload is coldstart: the time it takes a container to come up and print a message. The second is sha256sum of a 256-megabyte blob, which is almost entirely CPU with a read call every thirty-two kilobytes — a compute-bound workload with, as it were, a modest syscall habit. The third is a pipelined Redis PING loop run over TCP with gVisor configured to use the host’s real network stack; the socket syscalls are dispatched by the Sentry but serviced, ultimately, by the host kernel’s TCP implementation. The fourth is the same loop, but in gVisor’s default network mode, in which TCP rides through Netstack — a full TCP/IP stack written in Go and carried about by the Sentry wherever it goes. The fifth is wrk driving ten seconds of HTTP against nginx.
The ratios were these. Coldstart, 4.07x: runsc containers take about four times as long to boot as runc containers, which, in absolute terms, is about five hundred milliseconds against a hundred and twenty. SHA-256, 1.41x: the sentry-dispatch tax on roughly eight thousand read calls, amortized over the hashing work, is modest but visible. Redis-ping on host-network, 10.61x: a pipelined PING is the syscall-dense extreme — two sendto/recvfrom pairs per round trip, with almost nothing happening in between — and this ratio pins the upper bound of what sentry dispatch, alone, can cost. Redis-ping through Netstack, 12.40x: the same workload, but with the userspace TCP/IP stack on the path; the seventeen per cent delta from the previous figure is Netstack’s own share, and one remarks that it is smaller than one might have guessed. Nginx under wrk, 5.79x: this is, I suspect, about the shape of a typical HTTP microservice running under gVisor, and it is the number I would quote to anybody who wanted one round figure.
Reading down the ratios in the order given, one watches the tax rise as the workload becomes more syscall-dense. Hashing is mostly computation and pays little. Bouncing small requests through a socket is almost all syscalls and pays nearly an order of magnitude. nginx sits between them, because nginx does a great deal of socket work but it also does parsing, response assembly, routing, and logging, some of which is comparatively cheap under gVisor. The per-syscall tax is diluted by whatever computation happens between the syscalls. This is not a surprising pattern, but there is a useful clarity in having seen it confirmed in particular numbers.
6
If the performance picture is workload-shaped, the compatibility picture is less reassuring still. A userspace kernel that reimplements Linux reimplements a great deal of it — the Sentry’s syscall table runs to several hundred entries, and the coverage of ordinary-server syscalls is very good indeed — but it does not, and in some cases cannot, reimplement all of it. perf_event_open returns ENODEV; Linux’s hardware-performance-counter infrastructure is, inside a gVisor container, simply not present. io_uring, the newer asynchronous I/O mechanism, is partially supported and, depending on version and policy, usually turned off. bpf, including the extended BPF system calls that have become the ordinary way of writing Linux tracing and networking extensions, is unsupported. Most of the tools one reaches for when chasing a production problem in depth — perf record, the bcc suite, bpftrace — simply fail to run inside a gVisor container.
This is not, I think, a defect so much as a shape. A userspace reimplementation of a kernel can reimplement the parts that are mostly state machines — virtual file systems, sockets, pipes — but cannot really reimplement the parts that are hardware interfaces (perf counters, hardware breakpoints) or host-global facilities (BPF program attachment, cgroup BPF). What one gets, in exchange for the sixfold attack-surface contraction, is a container that politely refuses to answer some of the most powerful questions one might want to ask of it. Whether the bargain is a good one depends entirely on whether the workload inside had ever cared about those questions. For most Web services, it did not. For a data-plane workload of any ambition, it very much did.
It is worth pausing over the picture one is usually shown of gVisor, which is the version that gets repeated on conference slide-decks and vendor blogs. In that version, gVisor is a secure-by-default default, substitutable wherever one runs containers, and the tax is described in an averaged ratio — “about 5x” — that concedes nothing about where the tax actually falls. The real version is less tidy. gVisor is not a drop-in. gVisor is a fit — suitable for workloads of a particular shape, unsuitable for workloads of other shapes, and the shape at issue is not a matter of opinion but of which syscalls the workload happens to depend upon. The averaged ratio is an effigy of the truth, not the truth itself.
7
I come back from the tour with less a conclusion than a set of impressions that do not obviously cohere, and I am going to offer them rather than pretend they do.
The first impression is that the attack-surface arithmetic is the real argument, and that the whole edifice of benchmarks and compatibility tables exists to let one determine whether one can afford to accept the argument for any particular workload. Four hundred host syscalls down to sixty-eight is not a marketing figure but a structural one, and it is what makes the project interesting even when the benchmarks are unflattering. One can always write faster code; one cannot so easily rewrite a threat model.
The second is that the performance tax is not uniform and that treating it as uniform produces bad decisions. A 1.4x penalty on hashing is, in any practical sense, nothing. A 12x penalty on localhost TCP is disqualifying for some workloads and irrelevant for others. The useful question is never how fast is gVisor; it is how syscall-dense is this particular workload, and how much of that density is socket work.
The third is that the compatibility holes are the quiet determiner. The workloads that fail under gVisor fail most often not because the tax is too high but because something they need — perf, bpf, io_uring — is simply not there. The benchmark harness was useful; the compatibility harness — a workload actually launched under runsc and watched carefully, to see whether it did or did not do what it had been meant to do — was more useful still.
And there is, finally, a fourth impression, which I offer less confidently but will nevertheless record. Having built the small and dreadful toy version first — the one described in my earlier posts — and only afterwards gone to meet the large and serious version on its own ground, one comes back from the trip with a kind of respect for gVisor that no amount of reading the source could have produced. The ratios are legible because one has, in however reduced a form, written the code that produces them. The Tour, as it turned out, was worth the going.
