A microVM Small Enough to Read

Modern sandboxing arranges itself, if one steps back a few paces, on a rough spectrum. At one end sits the container, a cleverly fenced process that remains, under the fences, a process. At the other sits the virtual machine — a small second computer, with its own kernel, its own interrupts, its own illusion of having the hardware to itself. Between the two, engineering has inserted a pair of curiosities that belong to neither kind: gVisor, which is a kernel in userspace and intercepts its guest’s syscalls from above, and Firecracker, which is a virtual machine stripped of every convenience that a virtual machine ordinarily supplies. The two are, in a sense, responses to the same complaint — that containers are weak and virtual machines are heavy — approached from opposite directions and shaking hands in the middle.

I spent a few weeks at the gVisor end of this spectrum. The earlier writings on this blog — mini-sentry, the signals piece, the tour of gVisor’s own code — are records of what I found there. This post is a record of what happened when I walked, not very patiently, to the other end, and tried to build the small version of the thing that lives there.

The result is mini-firecracker. About 5,300 lines of Go, one external dependency (golang.org/x/sys), Linux and KVM and x86_64 only. It boots a real Linux kernel, serves it a virtio-blk block device backed by a file, serves it a virtio-net interface backed by a tap, gives it a serial console the guest’s shell writes to, and — the piece of Firecracker that made Lambda economical — dumps the running machine to disk and restores it into a fresh process in under three hundred milliseconds.

The headline demonstration, cribbed from the README, is a snapshot taken part of the way through the boot and resumed elsewhere:

bash

# Take a snapshot 800 ms into the boot.
$ ./mini-fc run \
      --kernel testdata/vmlinux-5.10.245 --mem 512 \
      --drive path=ubuntu-24.04.squashfs,ro \
      --net   tap=minifc0 \
      --cmdline "console=ttyS0 reboot=k panic=1 root=/dev/vda rootfstype=squashfs ro init=/bin/sh" \
      --snapshot-after 800ms --snapshot-to /tmp/snap
[mini-fc] snapshot written to /tmp/snap

# Restore and interact.
$ (sleep 1.5; printf '
    ip link set eth0 up
    ip addr add 172.16.0.2/30 dev eth0
    ping -c 3 172.16.0.1
    exit
  ') | ./mini-fc restore --from /tmp/snap
[mini-fc] restored from /tmp/snap in 262 ms
# ping -c 3 172.16.0.1
PING 172.16.0.1 (172.16.0.1) 56(84) bytes of data.
64 bytes from 172.16.0.1: icmp_seq=1 ttl=64 time=0.264 ms
3 packets transmitted, 3 received, 0% packet loss

That is an entire virtual machine’s lifecycle — booted, shelled into, networked, paused, serialised to disk, resumed in another process, continued — running through about five kilobytes of structured state and a 512 MiB memory blob. Having built this on one side of the sandboxing spectrum and mini-sentry on the other, I now have the curious sensation of having laid down two small flagstones at opposite ends of a long path and found, to my surprise, that the path is shorter than it looked.

Most of the VMM Is a Switch Statement

The body of the VMM, once one has granted oneself a few conveniences, is a handful of owned things and a loop.

Each VM is one Go process. It owns one file descriptor into /dev/kvm, one VM-level descriptor derived from that, an in-kernel interrupt chip and a programmable interval timer, a single anonymous mmap that serves as the guest’s RAM and is registered as KVM memory slot zero, and one vCPU pinned — by way of runtime.LockOSThread, KVM being insistently a per-thread interface — to a Go goroutine that will not migrate elsewhere. It also owns a minimal 16550A UART stub at the historic port 0x3f8, which the kernel will speak to the moment it is alive, and a collection of virtio-MMIO device slots at four-kilobyte intervals starting at 0xd0000000.

The hot path, once all of that has been arranged, is a single-threaded loop around KVM_RUN with a switch on the reason the guest stopped:

func (v *VMM) Run() error {
    runtime.LockOSThread()
    for {
        if err := v.vcpu.Run(); err != nil { return err }
        switch v.vcpu.ExitReason() {
        case kvm.ExitIO:          v.handleIO()
        case kvm.ExitMMIO:        v.handleMMIO()
        case kvm.ExitHLT:         return nil
        case kvm.ExitShutdown:    return errShutdown
        case kvm.ExitSystemEvent: ...
        ...
        }
    }
}

That switch is most of the VMM. KVM_EXIT_IO is dispatched to the serial stub, or to a small “is this a reboot port write?” detector. KVM_EXIT_MMIO is dispatched to whichever virtio transport happens to own the address the guest has just touched. Everything else either returns cleanly or surfaces as an error. Asynchronous work — the goroutine that reads frames from the tap into the guest’s RX queue, the one that fires at a preset time and pauses the machine so that it can be snapshotted — lives alongside and talks to the guest by pulsing the in-kernel interrupt chip with KVM_IRQ_LINE.

I had expected, before writing any of this, that the VMM would be a thing of considerable mass. It is not. It is a switch statement and a handful of friends, and the body of it is the one file.

A Kernel Boots Into Apparatus I Had Just Finished Writing

Booting a real Linux kernel into a machine one has just built, without the supplied conveniences of a BIOS, is considerably less magical than it sounds and considerably more so than it ought to be.

Firecracker — and mini-firecracker behind it — uses PVH, the Xen-derived fast-boot protocol that permits a hypervisor to skip every byte of firmware and deposit the kernel directly at a 32-bit protected-mode entry point with a single pointer in EBX. The kernel’s ELF image carries, among its notes, an XEN_ELFNOTE_PHYS32_ENTRY that names the address it wishes to resume from. One loads the program segments into guest memory at their declared p_paddrs, plants a small hvm_start_info struct at a known location and fills it with the cmdline and a memory map, sets up flat thirty-two-bit segments with a stub GDT, puts the start-info pointer in EBX, and calls KVM_RUN. Every subsequent transition — long mode, page tables, interrupt setup, the whole edifice of x86 housekeeping — the kernel arranges on its own behalf.

The package that does this is about six hundred lines, including the ELF parser and the start-info builder. The first kernel banner that emerged from my stubbed-up serial port was, frankly, unearned. Nothing I had written teaches the kernel anything about itself; the apparatus merely lays a table, and the kernel arrives and eats.

virtio, Stripped to Sixteen Bytes and a Kick

The guest, now booted into long mode and printing to a serial port, must be persuaded to see a disk and a network. This is what virtio is for.

virtio is a device-emulation contract between hypervisor and guest, organised around three shared-memory rings per device. The guest places buffer descriptors in an available ring; the device takes them, does its work, writes results, and places completed descriptors in a used ring. Both sides notify each other — the guest by writing a magic value to a doorbell register, the device by raising an interrupt on a configured IRQ line. This is the whole of the mechanism. There is no driver-specific negotiation more exotic than feature bits, no framing beyond the descriptor’s len field, no locking that is not already implicit in the queue indices.

mini-firecracker implements the modern virtio-MMIO transport (version 2), the split virtqueue, and two device drivers on top of the shared plumbing. The first of these is virtio-blk, backed by a host file — the guest sees /dev/vda, the host sees whatever squashfs or raw image one handed it, and the translation between the two is a few dozen lines of descriptor parsing and pread/pwrite. A fragment of cmdline, virtio_mmio.device=4K@0xd0000000:5, which mini-fc auto-appends per --drive, tells the kernel where to find the device on the bus; one sets root=/dev/vda and adds init=/bin/sh and the guest, having mounted its root and finished its boot, drops the operator at a shell prompt on the same serial port the kernel was just writing to.

The second driver is virtio-net, which is structurally identical to virtio-blk — the same transport, the same ring mechanics — and differs only in its having two queues, for receive and transmit, and in being backed by a Linux tap device rather than a file. The transmit path is synchronous: when the guest kicks the TX queue, mini-fc walks the chain of buffers and writes the Ethernet frames out to the tap file descriptor. The receive path is asynchronous, for the good reason that the host kernel is the one deciding when a frame is due; a goroutine blocks on the tap fd, and when a frame arrives it pulls a buffer from the guest’s RX queue, prepends the twelve-byte virtio_net_hdr, copies the frame in, and pulses the shared IRQ to tell the guest there is work. With a tap pre-created on the host, the guest comes up with an eth0 that answers pings at sub-millisecond latency in both directions.

A Computer Dumped and Resumed

Snapshotting and restoring a running virtual machine is the piece of Firecracker whose economics make AWS Lambda make sense. A fleet of pre-booted interpreters, pre-configured to the customer’s exact image, can be thawed into running VMs faster than the request reaches them. To restore in the low tens of milliseconds is to render the concept of a cold start fiscally uninteresting.

mini-firecracker’s snapshot is a directory on disk:

manifest.json — kernel path, cmdline, memory size, drive and net bindings, the configuration needed to reconstruct the VMM’s outer shell.
state.json — the vCPU’s registers, the VM’s own state, each virtio device’s internal bookkeeping, with the larger opaque blobs base64-encoded so the JSON remains human-readable.
memory.bin — the raw bytes of the guest’s RAM.

Restore is the reverse of every step that produced a snapshot. One mmaps a fresh memory region and copies the RAM in, replays every KVM_SET_* ioctl in the strict dependency order the kernel requires — clock, then IRQ chips, then the PIT, then CPUID, then SREGS, then REGS, then XSAVE, then XCRS, then FPU, then LAPIC, then MP state, then events, then MSRs — and calls KVM_RUN once more. The whole snapshot pipeline including device serialisation is something under eight hundred lines. Restore lands in sixty-five to seventy-seven milliseconds for a 128 MiB boot-only snapshot, and two hundred and sixty to two hundred and ninety milliseconds for the 512 MiB shell-with-networking case. The larger number is dominated by the memory.bin copy, which a mmap(MAP_PRIVATE, fd) of the snapshot file would skip entirely; that is a future optimisation I have not yet troubled to write.

I shall now narrate the three episodes, of the many that presented themselves, in which the thing declined to do what I had been expecting it to do.

A Kernel Whose Binary Does Not Know About Command Lines

The kernel cmdline fragment virtio_mmio.device=<size>@<base>:<irq> is the documented means by which Linux learns about an MMIO device when neither Device Tree nor ACPI is available. mini-firecracker generates these fragments automatically, and on the firecracker-CI 5.10 kernel they work exactly as the documentation suggests.

They did not work on the firecracker-CI 6.1 kernel. I spent an embarrassing quantity of time hunting the problem through guest userspace, through the virtio dispatch code, through the cmdline I had been appending, and finally — having exhausted every hypothesis I could produce from the running system — through the kernel binary itself:

bash

$ strings testdata/vmlinux-6.1.155 | grep -iE "Registering device virtio-mmio|virtio_mmio_cmdline"
$

Nothing at all. The cmdline-registration code path had been configured out at build time. Firecracker’s CI kernel is produced under the assumption that device discovery happens via generated ACPI tables on x86, and so the cmdline-registration symbols are stripped from the binary. mini-firecracker does not generate ACPI tables; mini-firecracker accordingly pins the 5.10 kernel, and will until some future version of me decides that writing a small ACPI-table emitter is the evening he has left to spend.

The lesson, which I had not quite absorbed before this, is that documented Linux interface and compiled into every kernel are entirely distinct propositions. The kernel is a menu; distributions and build systems order from the menu; one’s hypervisor must order from the same menu the guest did, or prepare to eat alone.

A Console That Would Not Deliver

The kernel, booted and speaking through the serial port, was the easy half. When the shell took over and I tried to type into it, an asymmetric failure declared itself. Typing was accepted; each character echoed back on the terminal. ls / produced a clean return to the prompt and nothing else. The kernel had announced, with its customary polite formality, “Run /bin/sh as init process”, and after that every byte of userspace output seemed to vanish into the apparatus I had built.

The shape of the asymmetry was the clue. Kernel printk reaches the UART via the polled write path — a sequence of direct writel instructions into the transmit register, each waiting for the previous to drain before emitting the next. Userspace write() calls arrive at the same UART through the tty layer, which uses the interrupt-driven transmit path: write one byte to the transmit holding register, set the transmitter-ready bit in the interrupt-enable register, wait for the interrupt that announces the byte has cleared, and only then emit the second byte. If the interrupt never fires, the tty driver is content to wait forever, the second byte never goes out, and the file descriptor looks, to a naive observer, to be in perfect working order.

My first-pass 16550A stub had no interrupt-identification register and no mechanism for raising IRQ 4 when the transmit holding register became empty. Some forty lines of additional state — tracking the interrupt-enable bits, firing IRQ 4 on transmit-enable and on each write to the holding register, modelling the interrupt-identification register with its priority scheme and its clear-on-read semantics — turned every tty write in the guest’s userspace into something the operator could see. A parallel bug presented itself in the same session: writes to port 0x3f8 while the line-control register’s DLAB bit was set were being routed to the operator’s terminal, when they ought to have been routed to the baud-rate divisor latches. The consequence had been that the first byte of every shell session arrived as a control character, \x0c, until a single check on the DLAB bit ended the exhibition.

The general principle I should like to record here is that a kernel’s satisfaction with its peripheral is not evidence of the peripheral’s correctness. The kernel has polled pathways for its own use and interrupt pathways for everyone else’s; a stub that implements only the former is a stub that supports only the kernel, and will deceive the operator that the guest’s userspace has gone mute.

It Is Always the MSRs

The first snapshot-and-restore round-trip worked. This ought to have been suspicious, but I took it as a gift. The guest resumed, the restore reported seventy-seven milliseconds, and five lines of perfectly correct continuation arrived on the console — and then, with no hesitation, this:

text

[mini-fc] restored from /tmp/snap1 in 77.258217ms
[    0.472834] i8042: Can't read CTR while initializing i8042
[    0.480069] Segment Routing with IPv6
[    0.482567] bpfilter: Loaded bpfilter_umh pid 148
[    0.483268] traps: PANIC: double fault, error_code: 0x0
[    0.483287] RIP: 0000:0x0

RIP at address zero is the tell, and it names the shape of the mistake before one has finished reading it. On x86-64, the SYSCALL instruction reads the kernel-side syscall entry point out of the model-specific register MSR_LSTAR, which lives at 0xC0000082. If MSR_LSTAR is zero at the moment SYSCALL executes, every syscall jumps to RIP equals zero and the CPU triple-faults shortly thereafter. The bpfilter kernel module, whose job it is to spawn a small userspace helper, had forked the helper; the helper had made its first syscall; the CPU had done its honest best to find the entry point at the address the MSR pointed to, and found nothing.

I had saved, between the vCPU and the VM, every general-purpose register and every segment, every chip’s state in sufficient detail to resume its interrupts, and the contents of every memory page. I had not saved any of the model-specific registers. The remedy is KVM_GET_MSR_INDEX_LIST, which asks the kernel for the canonical list of MSRs that round-trip safely, together with KVM_GET_MSRS and KVM_SET_MSRS to bulk-read and bulk-write them. The addition is some eighty lines. After it, every snapshot point I tested round-tripped without complaint.

There is a general principle that announced itself to me at that moment, with the loudness of a dropped tray, and that is this: when a kernel restores from saved state and immediately misbehaves at syscall entry, one forgot the MSRs. When it misbehaves in floating-point code, one forgot the XSAVE area. When its notion of wall-clock time jumps backwards, one forgot the kvmclock MSRs. Every half-resumed machine I have now watched misbehave has turned out to be a symptom of the same missing noun. It is, with a dreary reliability, always the MSRs.

An Accounting of Lines and Milliseconds

Two tables, in lieu of what would otherwise be a much longer paragraph.

The restore timings across the three workloads I routinely test against:

Workload	Mem	Restore
Kernel boot only, no devices	128 MiB	65–77 ms
virtio-blk plus interactive shell	512 MiB	285 ms
virtio-blk plus virtio-net plus ping	512 MiB	262 ms

And the line counts, by package, for the curious:

Component	LoC
`pkg/kvm` — ioctl wrappers, state save/restore	~1,100
`pkg/boot` — ELF, PVH, initial vCPU registers	~600
`pkg/virtio` — transport, virtqueue, block, net	~700
`pkg/vmm` — `KVM_RUN` loop, serial, snapshot glue	~1,100
`pkg/snapshot` — on-disk format	~250
`pkg/tap` — `/dev/net/tun` wrapper	~100
Other — CLI, host check, docs, tests	~1,400
Total	~5,300

A single external dependency, golang.org/x/sys. A statically linked binary of some four megabytes, which is, as these things go, closer to a negotiating position than a figure.

Afterwards

Two propositions have outlasted the code itself, and it is worth setting them down before the memory of the evenings that produced them has faded.

The first is that the conceptual layer of a microVM monitor is exceedingly thin. A working VMM is KVM_RUN inside a switch statement on the exit reason. virtio is a sixteen-byte descriptor format, two queue indices in shared memory, and a doorbell. A snapshot is sixteen or so KVM_GET_* ioctls, an mmap, and a list of model-specific registers one remembered to enumerate. I had expected a body of arcane knowledge; what I found was a short catalogue of contracts, each of which, once named, was not difficult to honour.

What makes Firecracker — or any production microVM monitor — is, I can now report with some confidence, not the microVM. It is everything that surrounds the microVM: the jailer that wraps the VMM in a seccomp profile and a chroot and a cgroup and writes down exactly what system calls the VMM process itself is permitted to make; the REST API that drives the control plane in a form the orchestrator expects; the CPU-template system that pins guest-visible CPUID leaves across host generations so that a fleet of Lambda workers can be live-migrated without the guests noticing they have moved from one generation of silicon to another; the rate limiters on I/O and network; the metrics; the careful accommodation of every CVE that has ever emerged from the KVM stack. The interesting part of a production microVM monitor is not the microVM. It is the production. mini-firecracker stops where the concept has been demonstrated, which is precisely where the easy part ends.

The second is what one gets out of building small versions of things one admires. Reading Firecracker’s upstream Rust is, on its own merits, pleasant; it is a clean codebase and one learns from it. But having built a smaller one has the curious effect of making the larger one legible in a way that reading it alone does not produce. I can now open firecracker/src/vmm/src/persist.rs and recognise, in the ordering of its calls, the shape of the triple-fault I encountered at midnight and its MSR-shaped remedy. The small version is not a replacement for the large one; it is a reader’s aid to it.

An Honest List of Absences

A small-minded completeness suggests it is worth recording, for any reader tempted to extend the project or mistake it for a production tool, the places where it declines to do what a production tool would. In rough order of how painful their absence is likely to be:

Snapshot-while-idle does not fire. The pause flag is examined at the next KVM_RUN boundary, and a HLT-ed guest stays in the kernel until an interrupt rouses it; the standard remedy is KVM_SET_SIGNAL_MASK paired with a tgkill at the vCPU’s OS thread.

Restore copies the memory from disk into a fresh mmap. The Lambda-grade trick is mmap(MAP_PRIVATE, fd) of the snapshot file itself — the first guest write copy-on-writes the affected page, every unread page stays on disk, and a 512 MiB restore drops from the high hundreds of milliseconds to something in the low single digits.

No ACPI table emitter exists, so the firecracker-CI 6.1 kernel remains off-limits. The 5.10 kernel is pinned in the testdata.

No jailer, no REST API, no PCI, no GPU, no live migration. mini-firecracker stays small enough to read, which is the entire point.

The next thing on this axis I should like to build is a common bench harness — runc, runsc, mini-fc, and real Firecracker on the same host, on the same workloads, with the same instrumentation — now that I have spent a few weeks with each. The numbers, having been sat with, are likely to be more interesting to read.

The code is at github.com/mtclinton/mini-firecracker. Its sibling on the gVisor axis remains github.com/mtclinton/mini-sentry, and the earlier writeups of that project are linked at the top of this one.