ironbox - Building a Container Runtime from Scratch in Rust

I built a container runtime. Not a wrapper around runc, not a shim that delegates to someone else’s code — an actual OCI runtime that uses fork, unshare, pivot_root, and mount directly. It’s called ironbox, and it’s on crates.io.

Why

I wanted to understand how containers actually work at the syscall level. Everyone uses Docker or containerd, but few people know what happens between “run this image” and “your process is isolated.” I figured the best way to learn was to build one.

The goal was to start with a containerd shim that delegates everything to runc, then incrementally replace each piece with native Rust until runc isn’t needed at all.

What it does

ironbox is a containerd shim v2 runtime. You install it, point containerd at it, and run containers:

cargo install ironbox
sudo cp ~/.cargo/bin/containerd-shim-ironbox-v1 /usr/local/bin/
sudo ctr run --runtime io.containerd.ironbox.v1 docker.io/library/alpine:latest test1 echo hello

That echo hello runs inside an isolated container with its own PID namespace, mount namespace, network namespace, cgroup, and rootfs — all set up by ironbox using Linux syscalls directly.

How it works

The core of the runtime is a double-fork pattern:

  1. The shim forks a middle process
  2. The middle process calls unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | ...) to create new namespaces
  3. It forks again — the grandchild becomes PID 1 in the new PID namespace
  4. The grandchild sets up rootfs (pivot_root), mounts (/proc, /sys, /dev), devices, environment, rlimits, capabilities, seccomp filters, and then waits
  5. When containerd calls “start”, the shim writes to a sync pipe, the grandchild reads it and execvps the container entrypoint
  6. The middle process waits for the grandchild and propagates the exit code back to the shim

That sync pipe trick is how the OCI lifecycle works — create sets up the container but doesn’t start it, start is a separate call. The pipe bridges those two operations across process boundaries.

The hard parts

pivot_root is picky. My first attempt failed with EINVAL because I didn’t make the mount tree private first. The correct sequence is: mount("", "/", MS_SLAVE|MS_REC) to prevent mount propagation, then bind-mount the rootfs onto itself to make it a mount point, then chdir into it, then pivot_root(".", "oldrootfs"). Getting that order wrong gives you a cryptic error and no container.

PID namespace + fork = headaches. When you unshare(CLONE_NEWPID), the calling process is NOT in the new namespace — only its children are. I first tried a single fork, but then sh -c 'echo hello | cat' would fail with “can’t fork: Out of memory” because there was no PID 1 in the namespace to parent the new processes. The double-fork fixed this but introduced a new problem: the shim monitors processes via waitpid on direct children, but the actual container process (the grandchild) isn’t a direct child. I spent a while debugging containers that would run but never exit. The fix was having the middle process wait for the grandchild and propagate the exit code, while the shim monitors the middle process.

Cgroup limits of zero. The OCI spec from containerd sometimes includes a memory section with limit: 0. I was writing that as memory.max=0 in the cgroup, which means zero bytes of memory allowed. Every fork got OOM-killed instantly. The fix was simple — skip limits that are zero or negative — but it took a while to figure out why echo hello worked fine but echo hello | cat got killed.

What’s in the box

The runtime handles the full container lifecycle:

  • Create — double-fork, namespace isolation, rootfs pivot, OCI mounts, cgroup v2 resource limits, capability dropping, seccomp BPF filters, AppArmor/SELinux profiles, uid/gid switching, loopback networking
  • Start — sync pipe signal, process exec
  • Kill — direct kill(2) syscall
  • Delete — cgroup cleanup, rootfs unmount
  • Execsetns into container namespaces, fork, exec
  • Pause/Resume — cgroup v2 freezer
  • Stats — direct cgroup metrics
  • Checkpoint/Restore — CRIU integration

The code is structured as a set of modules under src/runtime/:

src/runtime/
├── container.rs      — the double-fork + lifecycle
├── exec.rs           — setns + fork/exec for exec
├── rootfs.rs         — pivot_root, mounts, devices
├── namespace.rs      — unshare/setns helpers
├── cgroup.rs         — cgroup v2 create/apply/cleanup
├── capabilities.rs   — capability dropping
├── seccomp.rs        — BPF filter generation
├── apparmor.rs       — AppArmor profile
├── selinux.rs        — SELinux labels
├── network.rs        — loopback setup
├── checkpoint.rs     — CRIU checkpoint/restore
└── io.rs             — FIFO stdio

Testing

There’s an integration test suite that runs 17 tests against a real containerd:

sudo make test

It covers basic execution, PID 1 verification, pipes (fork works in the namespace), cgroup paths, filesystem isolation, loopback networking, capabilities, uid/gid, seccomp, long-running containers with kill, and exec into running containers.

What I learned

Building a container runtime taught me more about Linux than any other project I’ve done. Namespaces, cgroups, mount propagation, pivot_root, capabilities, seccomp BPF — these are things I’d read about but never implemented from scratch. The biggest takeaway is that containers aren’t magic. They’re a handful of syscalls, some careful ordering, and a lot of error handling.

The other thing I learned is that the OCI spec is simultaneously very detailed and full of edge cases. Fields that are “optional” in the spec might be required by containerd. Limits of zero might mean “no limit” or “actually zero.” The spec tells you what to do but not always in what order.

If you want to try it: