The shell one-liner, which twenty years ago required a trip to Usenet and the confidence of a stranger, is now produced by a language model the moment one asks for an installation or a fix. I type curl | sh more often than I ought to, and I read what sh is about to receive less often than I should. A Go tool’s install instructions arrive as a pipe through a shell. A forum answer suggests a bash -c the length of a paragraph. The LLM, asked for a build script, cheerfully produces one and expects no audit. The terminal, being an instrument of considerable trust, obliges one to extend that trust to whatever it is about to consume. Most of the time this is fine. Some of the time it is not, and one has a poor morning of it afterwards.

What I wanted was a small third thing, between the two choices the situation presently offers. The first choice is to run the script as oneself, which is the default and, in the ordinary run of things, what actually happens. The second is to spin up a virtual machine, which involves a disk image, a kernel, a network bridge, and a ten-minute detour before the original question — “will this command do what it claims?” — can be asked. I wanted a command one could paste in front of any other command and be done with it — something that would turn the whole affair into a brief, constrained, untrusted episode, cleared away at the end.

The result is gvisor-exec, and it does this:

plain
$ gvisor-exec -- uname -a
Linux gvisor-exec 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/Linux

The 4.4.0 is the tell. gVisor ships with a spoofed kernel version baked into its Sentry, and every program that asks uname() is answered with 4.4.0 regardless of the kernel the host is actually running. The syscalls the guest makes never touch the host kernel at all; they are intercepted and re-implemented in a userspace kernel written in Go. What arrives at the host is not the guest’s syscalls but the Sentry’s requests on the guest’s behalf, which are themselves run under a seccomp filter that refuses most of the interesting ones.

No Docker is involved. No containerd. No root. The binary wraps runsc in rootless mode, produces an OCI bundle in a temp directory, runs the command, and tears everything down.

A Command Between Bash and a Hypervisor

gvisor-exec is one Go binary of perhaps five hundred lines, half of which is the OCI spec builder and its validation. It wraps runsc --rootless behind a CLI tuned for one-shot sandboxing. The defaults encode a stance on what it means to “just run this thing”: the host / is visible read-only, writes land in an ephemeral overlay that is discarded on exit, network is off, every capability is dropped, and the sandbox receives its own PID, mount, IPC, UTS, and network namespaces. The process is isolated along every dimension the kernel offers and a few it does not.

In the ordinary idiom, the uses look like this:

shell
# A script from a forum, piped in.
cat sus.sh | gvisor-exec -- /bin/sh

# A build that reads from the current directory.
gvisor-exec -bind "$PWD:/mnt" -cwd /mnt -- make

# The same build, returning its output as a tarball on stdout.
gvisor-exec -ro-bind "$PWD:/mnt" -- sh -c 'cd /mnt && make && tar c build/' > out.tar

The implementation, taken in isolation, is embarrassingly boring. gvisor-exec composes a Config struct into an OCI runtime-spec JSON, writes the JSON to /tmp/gvisor-exec-XXXX/config.json, execs runsc --rootless ... run --bundle $dir $id, forwards stdio, waits, returns the exit code, and removes the bundle. The Go source is a templating engine in one hand and an exec.Cmd in the other. What makes the tool useful is not the code at all but the set of defaults the code encodes — and arriving at those defaults was the work. The rest of this is a record of how I arrived at them, in the order in which the arriving actually happened.

The Library That Will Not Be a Library

I had begun with the fond expectation that I would import gvisor.dev/gvisor/pkg/sentry/... and drive the Sentry directly from my own Go program. This is not on offer. The gVisor packages are not structured for external consumers — their interfaces assume internal coordination with the rest of gVisor’s machinery, the build system is Bazel and not easily persuaded otherwise, and the glue between the Sentry and its platform is not publicly exposed at all. One can read the code with profit; one cannot link against it with any pleasure.

The practical path is to treat runsc as a black-box binary with a stable CLI and shell out to it. Every other OCI-layer integration I have examined takes the same route: containerd’s shim-runsc-v1, Kata’s optional gVisor mode, the various systemd experiments, the CI runners that need a per-job sandbox. runsc is the supported entry point. The Sentry is the implementation detail. That division is not incidental, and it is worth accepting without argument.

The Arithmetic of Being One’s Own Root

runsc --rootless installs a user namespace of one entry: host uid 1000 is mapped to sandbox uid 0. The sandbox’s root, in other words, is me. There is no escalation in any meaningful sense; it is a spelling of my own identity, in a namespace that permits only that one spelling and no other.

I had initially configured the sandbox process to run as uid 1000, on the theory that a sandbox process should not appear, even to itself, to be root. The sandbox started. The files on my bind-mounted directory could be read. Writing to the same files returned EACCES. I spent longer than I care to confess working out why.

The arithmetic, once one sets it down, is not subtle. A file on the host is owned by uid 1000 (myself). Inside the user namespace, that same file appears owned by uid 0 — the only uid the namespace knows how to spell. The sandbox process is running as uid 1000, which is not the file’s apparent owner and not in any group that owns it, and so for the purposes of permission checks it is “other”. Mode 0755 on a file whose owner is someone else leaves “other” with r-x. No write bit. EACCES.

The remedy is to default the sandbox uid and gid to 0. The sandbox process is then, from its own point of view, the owner of the files it wants to write to, which it was all along — the namespace having quietly relabelled my identity at the door. The mapping is still the same single-entry identity map; the process is simply standing where the mapping expects it to stand.

An Overlay Wider Than Advertised

runsc’s overlay behaviour is controlled by an --overlay2 flag that takes a {mount}:{medium} pair. The first half is either root — an overlay on the rootfs only — or all, an overlay on every mount in the bundle. The second selects the storage: self is a tmpfs scratch directory the sandbox creates, memory is RAM, and so on.

I had started with root:self under the assumption that the rootfs overlay was what I wanted and that the bind mounts would quietly behave. This was wrong in a way that presented itself on the user as a bewildering failure. Writes to the bind-mounted directory returned EINVAL on openat(..., O_CREAT, ...). The strace was unambiguous — the syscall was legal, the target directory existed, and the flags I had passed were plausible — and yet the Sentry, in the end, returned EINVAL. One can reproduce the shape of the problem without knowing anything about gVisor: the kernel has said the request is legal, and something after the kernel has disagreed.

After some time spent in runsc’s debug logs I switched to all:self. The failures stopped, every mount received its own overlay, and writes landed in temp storage as intended. I did not pursue the original defect to its source. root:self plus writable bind mounts is, I suspect, an edge case in rootless mode that the maintainers have not been handed a reproducer for, and my workaround cost nothing. I did not file an issue. I did, however, write the decision down, which is why I am writing it down again now.

The Gofer Cannot Make Your Directories

gVisor’s filesystem is served to the sandbox by a separate trusted process called the Gofer, which speaks 9P over a socket to the Sentry and holds the only handles into the host filesystem. If the rootfs is the host /, as the defaults of gvisor-exec arrange, the Gofer has no permission to create directories anywhere on it. The host filesystem is to be read, not written.

The practical consequence is that bind-mount destinations must already exist on the host. gvisor-exec -bind "$PWD:/work" fails, because /work is a directory I have never created and which the Gofer cannot create on my behalf. The error runsc returns, in this case, is not a diagnostic but a condolence:

plain
cannot read client sync file: waiting for sandbox to start: EOF

The chain of events, reconstructed after the fact, is that the Gofer died attempting mkdir("/proc/fs/root/work"), the Sentry could not complete its handshake with a Gofer that was no longer on the line, and the handshake pipe accordingly reported EOF to whichever process was still reading it. The operator is told that something has gone wrong, and that is all.

I now validate this case ahead of time. If the rootfs is /, every bind-mount destination is os.Stat’d before the spec is handed off to runsc, and a missing destination produces an error of my own devising that names the missing path. The operator is told that bind destination "/work" does not exist on host, which is, unlike an EOF, a problem an operator can act on. The general lesson I should like to set down here is a habit rather than a fix: when a subprocess’s error messages are an entreaty rather than a diagnostic, catch the precondition in one’s own code and produce a diagnostic of one’s own.

A Network In Name Only

runsc’s default network mode, sandbox, creates a veth pair and installs iptables rules to route packets through the Sentry’s userspace network stack. Neither operation is available to a rootless user, and so the attempt announces itself plainly enough:

plain
failed to run /usr/bin/ip link add ve-runsc-...: exit status 2

--network=host is the remaining option, and on paper it places the sandbox in the host’s network namespace. In rootless mode this does not produce a working network so much as a skeleton of one. The routing table inside the sandbox is effectively empty; /proc/net/route returned me only its header row. DNS failed to resolve a single query. Direct connections to a literal IP address might succeed if one were fortunate in the configuration, and I did not dig further to establish the conditions under which they did. For a one-shot sandbox whose chief virtue is containment, the correct default is none, and that is what gvisor-exec picks.

Every serious tool that wants real network behaviour under a rootless-ish gVisor — the GitHub-hosted runners, modal.com’s runtime — has an out-of-band arrangement for the network. There is typically a privileged host process that pre-creates the veth, or a proxy the sandbox punches through to. For gvisor-exec I kept -network host available, documented its limitations, and moved on. A sandbox that cannot speak to the internet is, for my purposes, a feature.

Writes One Does Not Mean to Keep

I had, on arriving at the overlay question, held the view that “writes should persist to the source directory” was the obvious default. Between the EINVAL quirk and the user-namespace permission model, making this work cleanly was a project in itself, and as the yak shave lengthened I began to suspect that I had misread the requirement.

For the nine commands out of ten one would plausibly run under a tool of this kind, ephemeral writes are in fact the correct semantic. The script one does not trust should be permitted to believe that it has written its output. The output should not arrive on the host. If the operator wants some bytes back, the sandbox is given an instrument — tar to stdout, a file descriptor redirected on the way out, an explicit volume mount meant for the purpose — and the rest is discarded without ceremony. The defaults, I realised, were being asked to serve paranoia and not convenience, and paranoia is better served by forgetting than by remembering.

Having committed to the ephemeral default, the tool’s CLI fell out very naturally. The idiom matches, without any conscious borrowing, the way people use firejail, bubblewrap, and every container-based “scratch pad” the industry has produced. One writes into a thing that will evaporate. If one wants bytes out, one reaches in for them deliberately.

A Spec Smaller Than Its Reputation

I had been quietly dreading the OCI runtime spec. I had an image in my head of committee meetings and a thousand optional fields, a document composed by twelve organizations in mutual suspicion.

The spec one actually needs for a single sandboxed command is about thirty fields. Most of them are set to constants: an empty capability list, one rlimit, five namespaces, four default mounts, a process argv, a working directory, a terminal flag. It is one Go struct with some nested structs, a json.MarshalIndent, and a unit test that round-trips the result through json.Unmarshal to confirm that one has not produced JSON one’s self cannot read back.

One now understands, in a concrete way, why every container runtime that is not runc or runsc — Kata, Youki, crun — is implementable in a reasonable amount of code. The runtime spec is small. The hard part is the material that surrounds it: the image-unpacking and filesystem-layering, the CRI integration, the platform syscall interception. gvisor-exec gets to skip every last one of those, because runsc does them.

Afterwards

The project is a Go binary of the sort that lives in a single directory, a handful of test files, a Makefile, and five runnable examples. It is at github.com/mtclinton/gvisor-exec, and it requires runsc on the host; the static binary from gvisor.dev installs in a single curl. It is, by any honest metric, a wrapper. What I took from building it, though, is not the wrapper but the set of postures the wrapper encodes.

The first is that gVisor’s threat model is a model of interception, not of permission. The interesting thing about a sandboxed command is not the long list of capabilities it does not have; it is the fact that its syscalls do not reach the host kernel at all. A kernel vulnerability inside write() is, as the earlier gVisor writeups of mine keep wanting to say, of no consequence whatever to a sandbox whose guest never causes the host to run write(). This reframes the defaults. One turns off the network not because the command might try to exfiltrate, but because the small surface of the Sentry’s network stack is still larger than the surface one wants the untrusted command to have access to. One runs with no capabilities not to prevent the command from doing anything with them, but to shrink the number of behaviours the Sentry has to correctly re-implement.

The second is that ephemeral state is a sharper tool than permission state. A file the command wrote but which does not exist after the command has exited is more strictly controlled than a file the command wrote to a location one had tried to permission correctly. The former cannot be left behind; the latter can, and one has to reason about the manner in which it might be left. In every non-trivial sandbox I have built or configured, the cheapest safety comes from things that do not persist, not from things that were not allowed.

The third is that the hardest work in a tool like this lives in the defaults, and the defaults are load-bearing in proportion to how unobvious they were to arrive at. uid=0, all:self, -network none, root as read-only, bind-mount preconditions validated in the wrapper — every one of these was a day I spent not understanding something, and every one of them was subsequently hidden behind a flag the operator does not need to set. A good wrapper is a monument to the bad mornings of the author who wrote it. I take this, on the whole, as a reasonable arrangement.

The repository, again, is github.com/mtclinton/gvisor-exec; it is MIT-licensed and perhaps five hundred lines of Go. The examples under examples/ are the quickest way to acquire a working sense of the CLI. If one is curious to read the defaults and the reasons for them in the place they actually live, spec.go is the file to start with.