wisp - Running WebAssembly Under containerd

A containerd shim is, in the usual case, an undramatic piece of software. It receives a few RPC calls from containerd, forks a Linux process to run the container’s workload, reports the exit code back, and ends. wisp is one of these, and it does not fork. When containerd asks it to start a container under the io.containerd.wisp.v1 runtime, the shim loads the image’s payload as a WebAssembly module and runs it inside a Wasmtime instance. From containerd’s point of view there is another runtime on the host, alongside runc and runsc, doing the same offices for image pull, lifecycle, and stdio. From the workload’s point of view there is no Linux at all — no syscalls, no kernel, no namespaces — only a WASM guest and the small set of WASI bindings it is permitted to touch.

This is the fifth in a series of small sandboxes I have been writing on this blog, and the question it pulls on is the same question each previous entry has pulled on, which is where the right place to draw the isolation boundary lives. ironbox drew it at the kernel — namespaces and cgroups, in the manner of every Linux container of the past decade. The gVisor cluster — mini-sentry, a tour of the gVisor front, hijacking signals in Go, running what one did not write — drew it at the syscall, every guest call intercepted and emulated in userspace. A microVM small enough to read drew it at the CPU, with a second kernel under KVM and hardware-enforced separation. The quieter sandbox tightened the syscall boundary without changing where it was. wisp is the fifth answer, and it is the strangest of the five. It draws the boundary by removing one side of it.

The argument is structural rather than enforced. A Linux container draws a line and asks the kernel to keep the contents on the inside of it; a userspace kernel intercepts the line and emulates what should have happened on the other side; a microVM gives the guest its own kernel and asks the hardware to keep the two apart. wisp does none of these things. There is no kernel inside a WASM guest because WASM is not a kernel-oriented machine — it is a stack machine with linear memory, deterministic semantics, and no concept of syscalls. A guest cannot escape what is not there. What follows is an account of what I encountered while wiring this up against real containerd, which was kinder to me in some respects than I had expected and decidedly less kind in others.

The Shim and the Module

containerd’s shim v2 ABI is a small protocol with surprisingly many corners. The shim is a binary; containerd invokes it once with a start subcommand to spawn the long-running server, then connects to it over a Unix socket and sends TTRPC calls for the rest of the container’s life. The calls are the ones one would expect — Create, Start, State, Wait, Kill, Delete, Shutdown, plus a handful of others that are less obvious until they catch you out. Each running container has its own shim process (the “shim-per-container” model, which replaced the older “shim-per-task”). The shim’s job is to translate these calls into whatever the underlying runtime can act on — runc invocations for a Linux container, runsc for a gVisor sandbox, and, in our case, a Wasmtime instance.

Wasmtime, for its part, takes module bytes, an WASI context (env vars, args, preopened files, stdio), and a function name; instantiates the module; calls the function; and returns either a clean exit, a trap, or a host-side error. A proc_exit(n) from inside the guest surfaces as a particular kind of trap that the host can downcast and read n from. A division by zero, an out-of-bounds load, or an unreachable instruction surfaces as a different sort of trap. All of these have to be made to look like containerd’s (exit_code, exited_at) shape, and the shim is the gearbox in which the two vocabularies are reconciled.

I wrote the shim against containerd-shim 0.11 (the Rust crate from the runwasi project) and wasmtime 44. The trait surface for an engine is small enough to read:

rust

pub trait WasmEngine: Send + Sync {
    fn create(&self, cfg: &ContainerConfig) -> Result<()>;
    fn start(&self) -> Result<()>;
    fn wait(&self, timeout: Option<Duration>) -> Result<Option<ExitStatus>>;
    fn kill(&self) -> Result<Option<ExitStatus>>;
}

The first implementation, NoopEngine, sleeps for five seconds and reports exit 0. The second, WasmtimeEngine, actually runs the guest. Both live behind the same trait so the shim does not know which one it has. The exit-code convention I settled on, after a small skirmish over what to do with traps:

plain

guest behaviour                          exit code
─────────────────────────────────────────────────
_start returns cleanly                   0
proc_exit(n) (downcast I32Exit)          n
trap (OOB, div0, unreachable, …)         1
epoch interruption from kill()           137

The last row is the interesting one. WASM has no signals; “kill” inside a Wasmtime instance does not mean what it means in Linux. The mechanism Wasmtime provides for it is the epoch interruption — the engine carries a counter, each store can ask the engine to trap when the counter advances past a deadline, and Engine::increment_epoch() is the only way to make that deadline pass. I arm the deadline at 1 on store creation, never advance it under normal execution, and call increment_epoch() from kill. The guest sees a trap whose ultimate cause is, simply, that I asked. The run thread then post-processes: if the engine was killed and the natural exit code would have been 1 (since an epoch trap looks like any other trap), rewrite it to 137. The matrix stays tidy and the implementation does not leak through the trap path.

That is enough plumbing to run a hand-rolled (module (func (export "_start") (i32.const 0) (drop))) to clean exit and to interrupt an infinite-loop module in under five seconds. The unit tests passed. I was, for a pleasant interval, under the impression that the project was most of the way done.

Hello, Containerd

The integration test of any containerd runtime is ctr run. I installed the binary at /usr/local/bin/containerd-shim-wisp-v1 (the name containerd expects, derived from the runtime ID by transposing the dots and stripping the version suffix into a -vN tag), built a hello-world WASM module by writing a five-line Rust crate that did println!("hello") and compiling it for the wasm32-wasip1 target, and packaged the result as an OCI image archive — a tar containing oci-layout, an index.json, and a blobs/sha256/... tree referenced by the index. The archive’s manifest declared os=wasi, architecture=wasm. I ran:

bash

sudo ctr -n default image import /tmp/wisp-hello.tar
sudo ctr -n default run --rm \
    --runtime io.containerd.wisp.v1 \
    wisp.local/hello-wasm:latest hello

The result was a curt, definite error: ctr: image not found. The image had imported without complaint and then immediately not been there. This is a class of bug that announces itself as a mistake and then refuses to identify which one. I checked the manifest with ctr image ls; the entry was gone. I re-imported with -v; the verbose output mentioned image might be filtered out somewhere in the middle of the dump, in a sentence that did not bother to be alarming. ctr image import filters by host platform by default, and a linux/amd64 host does not, on inspection, recognise itself in a wasi/wasm manifest. --all-platforms is the flag, and once added the image stayed. This was the first lesson in what would shortly become a longer one: the well-trodden parts of the containerd toolchain make assumptions about what a container looks like, and a sandbox by absence violates a great many of them.

The image now in place, the next ctr run produced failed to create TTRPC connection: unsupported protocol. This was less encouraging than not found and, as it turned out, the beginning of an afternoon.

The Two Errors That Were One

The next thing one learns about a containerd shim is that the start subcommand has an unusual contract. containerd invokes the shim binary with a set of flags and the subcommand start; the shim is expected to fork a long-running server, bind a Unix socket, and print the socket’s address on stdout. containerd parses this address out of the shim’s stdout, connects to it, and uses that connection for every subsequent RPC. The protocol is human-readable: a single line, unix:///run/containerd/s/<hash>, with no preamble and no postamble. If anything else lands in containerd’s reader before the address, the URL parses as something else, and the connection fails.

In containerd 2.x, the shim’s stderr is wired into the same reader as its stdout during this lifecycle invocation. I do not know whether this is intentional or whether the abstraction has merely loosened over time; I know only that any byte I write to fd 2 during start mode lands in the buffer containerd is parsing for the URL. The first thing I had written into main was a call to env_logger::Builder::from_env(...).init(). The diagnostic logs that initialisation produced, helpful in development, lodged themselves above the address line in the buffer containerd read, and the address-parser confronted something like [INFO wisp] invoked as io.containerd.wisp.v1: ...\nunix:///run/containerd/s/abc123. The parser took the first thing that looked URL-shaped, which was not a URL, and called the result unsupported protocol.

I removed the env_logger init, anticipating success. The next ctr run produced a different error: dial unix /run/containerd/s/<hash>: connect: no such file or directory. The address was now correct, but the socket on the other end had ceased to exist. This was the second of the two errors, and it had the same cause as the first one wearing a different costume.

Server mode — the long-running TTRPC mode the shim enters after start — has its own log infrastructure. The containerd-shim crate’s bootstrap routine calls containerd_shim::logger::init, which does log::set_boxed_logger against a writer wired to containerd’s per-container log FIFO. The log crate enforces that set_boxed_logger may be called at most once per process. If env_logger had won the race (because I called it from main, before the crate’s bootstrap), the crate’s init returned SetLoggerError, bootstrap returned Err to the shim binary, the child died before binding its socket, and ctr was left dialling a path that never existed. The two errors looked nothing alike at the failure site. They were, mechanically, the same error twice.

The fix was three lines, which I append here with the apology one is obliged to make to a reader who has just been asked to follow three pages of diagnosis for them:

rust

// Don't call env_logger::init here. containerd 2.x merges stderr
// into stdout during `start` (corrupts the address), and in server
// mode the containerd-shim crate's logger::init fights us over
// log::set_boxed_logger (bootstrap fails before binding).
// Let the crate own logging.
shim::run::<WispShim>(RUNTIME_ID, None).await

I keep the clap CLI struct that parses the wire surface as runnable documentation — when the post asks “what does containerd actually send a shim binary?” the answer is right there in Cli — but I do not call it. The struct is exercised from unit tests so it cannot rot. The behaviour of main is to hand off and stay silent. There is, one is obliged to concede, a small comedy in the discovery that the bug was that I had been too loud.

The Method I Did Not Implement

The next run produced Connect is not supported, and aborted before the WASM guest had a chance to run. This was a different sort of error from the previous two — not a wiring fault but a missing method.

The Task trait in containerd-shim 0.11 has a default implementation for every method that returns Error::Unimplemented. The intent is sensible: a shim need only implement the methods it uses, and the rest will produce a polite “not supported” rather than a crash. The catch is that ctr run calls Task::connect immediately after Start — to attach to the task’s stdio before it begins producing output. If the default returns Unimplemented, the run aborts at the precise moment the guest is about to print something.

A Task::connect implementation in wisp does nothing surprising:

rust

async fn connect(
    &self,
    _ctx: &TtrpcContext,
    _req: api::ConnectRequest,
) -> TtrpcResult<api::ConnectResponse> {
    let mut resp = api::ConnectResponse::default();
    resp.shim_pid = std::process::id();
    resp.task_pid = std::process::id();
    resp.version = "wisp".to_string();
    Ok(resp)
}

There is exactly one process — the shim itself — and the WASM guest does not have a PID in any sense Linux would recognise. Returning the shim’s own PID for both fields is the convention runwasi uses, and it is what tools that ask about the task’s PID will get. The educational moment here, if there is one, is that defaults are a polite ambush. The Rust type system would not have let me forget to implement a method whose default was Unimplemented!(). The trait’s default of Err(Unimplemented) is more humane and less safe. Anywhere a library hands one a trait with default Err implementations, one had better look up which methods are actually called in the happy path.

The Rootfs One Mounts Oneself

The shim now spoke the protocol. The next ctr run reached the WasmtimeEngine and asked it to load module.wasm from the rootfs path the shim had received in the CreateTaskRequest. The path was /run/containerd/io.containerd.runtime.v2.task/default/hello/rootfs. It was empty.

In containerd 1.x the rootfs at <bundle>/rootfs/ was prepared by containerd itself: snapshots were mounted, layers were unpacked, the path was an actual directory the shim could open() into. In containerd 2.x this convention was relaxed. The CreateTaskRequest now carries a Vec<Mount> and the shim is expected to either run mount(2) for each mount itself or read directly from the snapshot’s underlying source. The change is a sensible bit of decoupling — different runtimes want different views of the layers — but it caught me out, because every example I had read assumed the older shape.

The cleanest answer would have been to call nix::mount::mount over the overlay specification containerd sent. I did not do this, not yet. I shortcut: containerd’s overlay mount carries a lowerdir= option in its mount options string, and the layer containing module.wasm is, for a single-layer WASM image, the only entry in lowerdir. So resolve_rootfs parsed the options string, fished out the lowerdir, and read module.wasm straight from there. The proper mount handling would in due course replace this — it does, in the version of the repo a reader visits today — but the shortcut was sufficient to get the next ctr run to find the module, and the rest of the matrix wanted writing more than the rootfs assembly wanted polishing.

The point worth taking from this section, and from the previous two, is one I should like to leave plainly: a sandbox runtime is mostly not sandbox code. It is glue. The Wasmtime side of wisp — the part that runs the WebAssembly — is a small, tidy module that calls perhaps eight Wasmtime functions and has 200 lines of code. The containerd side is more than four times the size, and almost all of the bugs lived in it. The interesting intellectual content of the project is in the engine; the interesting practical content is in the gearbox.

Hello

After the rootfs change, the next ctr run produced this:

plain

$ sudo ctr -n default run --rm \
    --runtime io.containerd.wisp.v1 \
    wisp.local/hello-wasm:latest hello
hello
$

A WebAssembly module compiled from five lines of Rust, packaged as an OCI image, pulled by containerd, handed to a shim that does not fork, run inside a Wasmtime instance that has no concept of Linux, with its stdout wired through a FIFO that containerd attached to ctr’s terminal. Five distinct subsystems agreeing for a moment to produce one word.

I will spare the reader an account of the smaller surprises that followed — the snapshot leftovers from a half-failed previous run that had to be cleaned up by hand, the moment I confused container name with image tag and chased the wrong error for fifteen minutes — and note only that the integration test, having been written to be unflattering, was unflattering precisely until it was not.

After the Hello: The Matrix

The single hello on stdout proved the wiring; the next several days proved the wiring carried weight. Each cell of the exit-code matrix wanted its own probe — a WASM module that would trigger that particular outcome and let me observe what containerd reported. I rewrote the hello-world crate as a multi-mode probe: a single binary whose first argument selected what to do next. proc_exit 42 would call WASI’s proc_exit(42) and let me check that ctr reported 42. trap would deliberately trap. loop would spin forever and let me kill it from another shell. print-stderr would write to fd 2; read-stdin would echo a line.

Each cell taught me something. The three worth recording here are the ones whose lessons were not specific to wisp.

The first was a question of where the probe found its mode. The OCI image’s Entrypoint was set to /module.wasm, and I had assumed ctr run --runtime io.containerd.wisp.v1 wisp.local/probe:latest myname trap would hand the guest an argv consisting of ["/module.wasm", "trap"] — the entrypoint preserved at argv[0], my appended argument at argv[1]. So the probe read its mode from args.get(1). Every mode I selected, the probe ran the default — it printed hello. This persisted, with mounting indignity on my part, through several careful rebuilds. The behaviour ctr run actually implements is total replacement: the trailing arguments overwrite the OCI Entrypoint+Cmd entirely rather than being appended to it. The guest’s argv was ["myname", "trap"], with my container name in args[0] and the mode I had asked for in args[1] — but the probe, expecting the entrypoint to have survived, was reading the container name and treating it as a mode it did not recognise. Reading from args.first() instead of args.get(1) was a one-character fix. The fifteen minutes I had spent hunting for a bug in the WASI argv plumbing was an honest tax on the assumption that ctr worked the way docker works.

The second was the trap cell, and it deserves its own paragraphs because it is the most instructive of the matrix bugs. The probe’s trap mode looked like this:

rust

"trap" => unsafe { std::hint::unreachable_unchecked() },

unreachable_unchecked is the Rust standard library’s way of telling the compiler that a code path will never execute. It is unsafe because if the path does execute, the program is in undefined behaviour territory. In a debug build it lowers to a panic; in a release build it lowers to whatever the compiler decides is most useful, which on most targets is some sort of trap instruction. I was building release with link-time optimisation enabled, expected an unreachable opcode in the WASM, and a Trap::UnreachableCodeReached from Wasmtime when the guest ran. What I got was hello.

The compiler had read the hint with more attention than I had paid to it. unreachable_unchecked is not an instruction; it is a promise to the optimiser — the promise that the call will never be reached. The link-time optimisation pass took me at my word, proved the arm containing the trap was dead code (because no caller could reach it without violating my own promise), and dead-code-eliminated the entire arm. The probe’s match statement now had no "trap" case at all; it fell through to the default arm, which printed hello. The trap was not absent because of a Wasmtime bug. It was absent because I had told the compiler not to emit it, and the compiler had believed me.

The fix was to switch from a UB-hint to an actual instruction:

rust

"trap" => core::arch::wasm32::unreachable(),

core::arch::wasm32::unreachable lowers to a real unreachable opcode in the emitted WASM, regardless of optimisation level. The compiler does not have an opportunity to optimise it away because there is nothing to optimise — it is the instruction itself, embedded in the binary. The trap mode now traps, Wasmtime reports Trap::UnreachableCodeReached, the engine maps it to exit 1, ctr reports 1. The cell works. There is a parallel here to the signals post’s ABIInternal wrapper, and it is worth naming. In both cases the compiler did something kind that turned out to be the bug. The Go toolchain wrapped my signal handler’s entry point in a frame-management prologue I had not written; Rust’s optimiser took my unreachable_unchecked hint as the gospel it had been advertised as. Anywhere the host language meets a low-level contract — signal entry, syscall trampolines, FFI, the WASM ABI — one must assume that the compiler is doing something helpful, and that one will sooner or later have to undo it.

The third was the kill cell, and its lesson concerns Wasmtime’s epoch interruption mechanism more specifically. I needed a probe that ran an unbounded loop, so a separate shell could ctr task kill it and I could verify the engine produced exit 137. The simplest such loop is an empty one:

rust

"loop" => loop { std::hint::black_box(()); },

black_box exists to defeat optimisations that would otherwise eliminate the loop body. With it, the loop body is a no-op the optimiser cannot prove away, and the loop itself remains in the emitted WASM. I built the probe, ran it, killed it from another shell, and observed that the entire host went still. ctr task delete blocked, the shell that issued the kill stopped responding, and Ctrl-C produced no result because the shell was waiting on an RPC that was waiting on a wait that was waiting on a guest that was not going to exit. The mechanism I had relied on — Wasmtime’s epoch interruption — does not work uniformly on every loop a WASM module might contain. The epoch pass instruments certain back-edges in the control-flow graph with a check against the engine’s epoch counter; the empty loop { black_box(()) } produces a particular shape of WASM loop whose back-edge the pass does not instrument. The guest spun, the engine’s increment_epoch() did exactly what it was supposed to do, and nothing trapped because nothing was watching for the trap. The whole host hung on a single uninterruptible loop opcode. Giving the loop body a counter increment — loop { i = i.wrapping_add(1); std::hint::black_box(i); } — produces a back-edge the epoch pass does instrument, and the kill cell works. There is, one is obliged to concede, a small horror in having pinned an entire Linux box on a loop {}, even temporarily. It is the kind of thing one writes a careful comment about and never forgets.

Where wisp Sits, and What It Removes

The five sandboxes I have built on this blog form a small library on a single question. It is worth saying plainly what each of them answers.

ironbox draws the boundary at the kernel. The container shares the host kernel with everything else on the box; isolation is enforced by namespaces (the kernel pretends each container has its own PID space, network, mount namespace, and so on) and by cgroups (the kernel meters each container’s resource use). Every syscall the container makes is a real syscall against the real kernel. A kernel exploit inside, say, setsockopt is a problem for every container on the box, because every container is using the same setsockopt.

The gVisor cluster — mini-sentry, the tour of the gVisor front, the signals notes, the running what one did not write essay — draws the boundary at the syscall. The guest’s syscalls are intercepted in userspace (by ptrace or seccomp-unotify) and emulated by a userspace kernel rather than reaching the host kernel. A kernel exploit in setsockopt is no longer a problem, because the host kernel is no longer running setsockopt — my Go code is. The trade-off is that the userspace kernel is now a non-trivial piece of software with its own bugs.

A microVM small enough to read draws the boundary at the CPU. The guest runs in its own kernel inside a KVM virtual machine; the host kernel never touches the guest’s syscalls because the host kernel is on the other side of a hardware-enforced separation. The trade-off is the price of the second kernel, in memory and boot time, and the need to keep both kernels patched.

The quieter sandbox does not change where the boundary is; it makes the existing boundary smaller. Observation-driven seccomp profiles produce a tighter syscall allowlist than any human will write by hand. The boundary is still at the syscall, where ironbox put it; what has changed is its area.

wisp does not draw a boundary. The WASM guest has no syscalls because WASM is not a syscall-oriented machine. The instructions it executes are stack manipulations, memory loads against its linear memory, and calls to host-provided functions whose signatures are declared in advance. There is no setsockopt to intercept and no setsockopt to delegate; if a guest wants something the host has not declared a function for, the guest cannot ask for it. The isolation is structural — it inheres in the shape of the language — and not enforced by anyone in particular.

This is a different kind of safety property from any of the four above. ironbox’s safety depends on the kernel being correct. mini-sentry’s depends on the userspace kernel being correct. The microVM’s depends on the hardware and the host kernel being correct. wisp’s depends on Wasmtime correctly executing WASM bytecode — which is a smaller claim, against a smaller target, with a smaller attack surface. It is also a more restrictive kind of safety, because most existing software cannot be run inside a WASM guest at all; it must be compiled to a target that does not assume the existence of an operating system. There is no free lunch here. wisp removes a class of attacks by removing the surface they would have run against, and removes a great deal of useful software along with the attacks.

What I Actually Learned

Three things seem to me worth writing down.

The first is that a runtime is mostly its gearbox. The Wasmtime portion of wisp, which is the part that actually executes WebAssembly, is short and tidy and untroubled. The containerd portion, which is the part that translates between containerd’s vocabulary and the engine’s, is four times the size and contained every bug in this account. When one sets out to write a runtime, one is, in practice, signing up to write a translation layer between two existing pieces of software. The novelty of the runtime is in the engine; the cost of the runtime is in the gearbox. This pattern is, I suspect, general.

The second is that one’s tools ambush quietly. The Task::connect method in containerd-shim came with a default that returned Unimplemented; the consequence was that ctr run aborted at the precise moment my code was about to produce output. Rust’s optimiser, given a std::hint::unreachable_unchecked() in a release build with link-time optimisation enabled, dead-code-eliminated the entire arm of a match that called it — because the hint had promised the call would never be reached — and the symptom was that the trap cell of my probe printed hello instead of trapping. Both ambushes shared a shape: a piece of the language’s surrounding machinery, a default trait method or a compiler optimisation pass, was doing something polite and reasonable that turned out, at exactly the wrong site, to be the bug. Anywhere the language hands one a contract whose breach is a courtesy rather than a panic, one had better look up which of those courtesies will obtain in the happy path of one’s code.

The third is the structural one, and it is the most important of the three. There are five places to draw an isolation boundary — the kernel, the syscall, the hardware, the syscall-tightened-by-observation, and the ABI-by-absence — and one chooses among them before one chooses anything else. Each choice has a cost the other choices do not have. ironbox’s cost is sharing a kernel. mini-sentry’s cost is implementing one. The microVM’s cost is running two of them. The quieter sandbox’s cost is the observation pass that comes before the policy. wisp’s cost is that most existing software cannot run inside it. The right question, faced with a real workload, is not which sandbox is best but which cost one would rather pay. The five posts taken together are an attempt to make the choices visible in the same room, side by side, so the question can be asked properly.

The source is at github.com/mtclinton/wisp. The shim’s TTRPC service lives in src/shim.rs; the Wasmtime engine in src/wasmtime_engine.rs; the smoke procedure for ctr run against the full exit-code matrix, along with the hard-won lessons recorded as they were discovered, in TESTING.md. For the architectural prelude — the question the series has been working on — see ironbox and the gVisor cluster linked above.