mini-sentry - Building a Userspace Kernel in Go

There is a particular kind of understanding that only comes from building a thing yourself — not reading about it, not studying the source, but sitting down with an empty file and discovering, one segfault at a time, why the original was built the way it was. I had been circling gVisor for months, admiring its architecture from a safe distance, the way one might admire a cathedral without quite grasping the engineering of the flying buttresses. And so I did what seemed like the only honest thing: I built a small one of my own.

mini-sentry is not a container runtime. It is not a syscall filter. It is, if I may put it plainly, a userspace kernel — a program that sits between a sandboxed process and the Linux kernel, intercepts every system call, and handles them in Go. The sandboxed process believes it is speaking to Linux. It is, in fact, speaking to me.

The Appeal of an Interposed Kernel

gVisor is one of the more remarkable pieces of systems software in active production. Google runs it beneath GCE, Cloud Run, and Cloud Functions, where it serves as a kind of diplomatic intermediary between untrusted workloads and the host kernel. The security proposition is elegant in its simplicity: if the sandboxed process discovers a kernel exploit, it detonates harmlessly against your Go code rather than the real kernel. The blast radius collapses to something manageable.

But gVisor is also two hundred thousand lines of Go — a codebase whose architecture reveals itself reluctantly, like a landscape glimpsed through fog. I wanted to understand the core ideas without the weight of production concerns: syscall interception, userspace handling, filesystem virtualization, process isolation. The smallest version that still captured the essential shape.

Architecture

The result has the same layered structure as its larger cousin, compressed to its fundamentals:

┌──────────────────────────────────────────┐
│  Guest Process (sandboxed)               │
│  Thinks it's talking to Linux            │
└─────────────────┬────────────────────────┘
                  │ syscall
┌─────────────────▼────────────────────────┐
│  Platform (ptrace or seccomp)            │
│  Intercepts syscalls, reads registers    │
└─────────────────┬────────────────────────┘
                  │ SyscallArgs
┌─────────────────▼────────────────────────┐
│  Sentry (Go handlers)                   │
│  The userspace kernel. Handles read,     │
│  write, openat, stat, socket, etc.       │
└─────────────────┬────────────────────────┘
                  │ RPC
┌─────────────────▼────────────────────────┐
│  Gofer (separate process)                │
│  Serves files over Unix socket.          │
│  Filesystem security boundary.           │
└──────────────────────────────────────────┘

The Platform layer knows nothing of meaning. It intercepts syscalls from the sandboxed process — using either PTRACE_SYSEMU or SECCOMP_RET_USER_NOTIF, depending on your appetite for performance — reads the register values, and passes them upward. It is, one might say, the ear that hears without comprehending.

The Sentry is the mind. It maintains a table-based syscall dispatch (much like gVisor’s own SyscallTable.Lookup()), an fd table, and handlers for roughly sixty system calls. When the guest issues read(fd, buf, count), the Sentry serves data from its virtual filesystem. When it calls getpid(), the Sentry returns 1. The real kernel never learns these conversations took place.

The Gofer is the boundary. A separate process that mediates all filesystem access, communicating with the Sentry over a Unix socket through a gob-encoded wire protocol. Even if an attacker compromises the Sentry entirely, the damage is contained — the Gofer serves only the files it has been told to serve. This mirrors gVisor’s LISAFS architecture, and it is, I am inclined to think, the most important security decision in the entire design.

The Ptrace Interception Loop

The ptrace platform reduces to a surprisingly tight loop, which I’ll reproduce here because it captures the essential mechanism more clearly than any description could:

func (p *PtracePlatform) interceptLoop(pid int) (int, error) {
    for {
        // Resume child, stop at next syscall entry, SKIP the real syscall
        ptraceSysemu(pid, 0)

        // Wait for child to stop
        syscall.Wait4(pid, &ws, 0, nil)

        // Read registers → syscall number + args
        unix.PtraceGetRegs(pid, &regs)
        sc := regsToSyscall(&regs)

        // Let the Sentry handle it
        ret, action := p.sentry.HandleSyscall(pid, sc)

        // Write return value back into RAX
        setSyscallReturn(&regs, ret)
        unix.PtraceSetRegs(pid, &regs)
    }
}

The critical detail is PTRACE_SYSEMU — constant 31, an unassuming number for such an extraordinary capability. Unlike PTRACE_SYSCALL, which stops the child on entry and exit while actually executing the syscall, SYSEMU stops only on entry and instructs the kernel to skip execution entirely. Whatever value you write into RAX becomes the result the child process sees. You are, in the most literal sense, the kernel now.

The Problem of Passthrough

The first version crashed immediately, which — if one has spent any time at all with systems programming — is the customary greeting.

The culprit was arch_prctl(SET_FS), the syscall responsible for configuring the FS register used by thread-local storage. This is kernel-managed state, and no amount of clever register manipulation from userspace can substitute for the real thing. The same proved true for mmap, brk, mprotect, and signal handling — a whole family of syscalls that must touch the actual kernel because they modify the very substrate on which your sandbox is running.

The solution was a hybrid approach: some syscalls get emulated by the Sentry, others pass through to the real kernel. But here is where PTRACE_SYSEMU reveals its peculiar stubbornness. When it stops you at syscall entry, the kernel has already advanced the instruction pointer past the syscall instruction and flagged the call as emulated. The skip is, as it were, sticky. You cannot simply resume and hope the kernel will oblige.

The workaround requires a small dance: rewind the instruction pointer by two bytes (the width of the syscall instruction on x86_64), restore RAX to the original syscall number — the kernel has already overwritten it with -ENOSYS during the SYSEMU stop — and resume with PTRACE_SYSCALL instead of SYSEMU. The child re-executes the instruction, the kernel runs it for real this time, and you receive a proper syscall-exit stop with the result waiting in RAX.

func (p *PtracePlatform) passthroughSyscall(pid int, regs *unix.PtraceRegs) error {
    rewindSyscallInstruction(regs) // RIP -= 2 on amd64
    restoreSyscallNumber(regs)     // RAX = orig_rax
    unix.PtraceSetRegs(pid, regs)

    // Entry stop (real this time)
    syscall.PtraceSyscall(pid, 0)
    syscall.Wait4(pid, &ws, 0, nil)

    // Exit stop — kernel ran the syscall, result is in RAX
    syscall.PtraceSyscall(pid, 0)
    syscall.Wait4(pid, &ws, 0, nil)
}

This is the same technique gVisor’s ptrace platform uses for syscalls it delegates to the host kernel, and watching yourself arrive at it independently — after the segfault, after the confused staring, after the late-night reading of kernel source — is one of those small satisfactions that makes systems work worthwhile.

Seccomp, or the Thirty-Seven-Fold Improvement

The ptrace platform suffers from an inherent limitation: every syscall, even the ones the kernel handles perfectly well on its own, requires a round-trip context switch to the tracer process. In a tight getpid() benchmark, ptrace manages about 17,000 calls per second. Respectable, perhaps, but not what one would call brisk.

The seccomp platform addresses this with a BPF filter that sorts syscalls at the kernel level, before they ever reach userspace. Calls the Sentry needs to handle — read, write, openat, and their kin — receive SECCOMP_RET_USER_NOTIF, which freezes the child and notifies the Sentry. Everything else — mmap, brk, futex, the whole bureaucratic machinery of memory management — receives SECCOMP_RET_ALLOW and proceeds directly through the kernel. The Sentry never stirs.

The bootstrapping, I should note, is its own small adventure. The seccomp() system call must be issued from within the process the filter applies to, but the notification file descriptor needs to end up in the Sentry. The solution is a re-exec dance: the parent fork-and-execs /proc/self/exe with an environment variable flag, the child installs the filter, sends the listener fd back over a Unix socket via SCM_RIGHTS, and then calls execve on the real target program. It is the kind of choreography that looks improbable on paper but works beautifully in practice.

The results are striking: 640,000 getpid() calls per second under seccomp, against 17,000 under ptrace. A thirty-seven-fold improvement. For passthrough syscalls, the Sentry process sleeps undisturbed while the kernel does its ordinary work. This is, not coincidentally, why gVisor moved from ptrace to their systrap platform — built on SECCOMP_RET_TRAP and shared memory — for the same fundamental reason.

The Gofer as Security Boundary

The Gofer deserves its own consideration because it embodies a principle that is easy to state and surprisingly hard to internalize: the process that handles untrusted input should not be the process with access to sensitive resources.

The setup follows a familiar pattern. The parent creates a Unix socketpair, fork-and-execs itself with MINI_SENTRY_GOFER=1, and the child enters a request-response loop, serving files from either an in-memory store or a host directory specified by --gofer-root. The Sentry sends RPC requests — open, read, stat, list — and the Gofer responds with file data or errors. The wire protocol is length-prefixed gob encoding over the socket, which has the virtue of being both simple and difficult to exploit.

The Gofer also handles the security details one learns to worry about only after they’ve been exploited: --gofer-deny for blocking specific paths, and symlink resolution on both sides via filepath.EvalSymlinks to prevent the classic escape where a malicious symlink points outside the served directory.

Network Virtualization

The sandbox extends its reach to the network layer by intercepting socket(), connect(), sendto(), recvfrom(), and the sockopt family. When the guest calls connect(fd, addr, len), the Sentry parses the sockaddr_in, checks the destination against a configurable policy — --net-allow and --net-deny — and, if permitted, performs a real net.Dial from the Sentry process itself. The guest receives a virtual file descriptor backed by a net.Conn that the Sentry owns. The real kernel never learns the guest wanted to reach the network at all, which is rather the point.

Testing the Assumptions

A project like this lives or dies by the quality of its tests, because the failure modes are subtle and the consequences of missing one are, at minimum, a sandbox escape. The test suite approaches the problem from several angles: a Go guest program that exercises sandbox identity and VFS operations, static C binaries — echo, cat, ls, pwd — that test real program execution under interception, an adversarial edge-case binary designed to probe the boundaries, a stress test for sustained operation, and a syscall fuzzer that fires ten thousand random syscalls to see what survives. Go fuzz tests target the path resolver, wire protocol, and syscall argument parsing. Property-based tests verify invariants — that deny rules always block, that virtual files always override host files. All eighteen integration checks run in CI on every push.

What the Segfaults Taught Me

The deepest lesson was one that no amount of source reading could have conveyed: you cannot emulate everything from userspace. PTRACE_SYSEMU presents itself as total control — you are the kernel, you handle every call — but there exists a class of syscalls that modify kernel-managed state so fundamental that no userspace simulation will suffice. Page tables, TLS registers, signal masks. These must touch the real kernel, and the hybrid SYSEMU-to-SYSCALL passthrough mechanism that solves this problem is not obvious until you’ve watched your sandbox segfault on the very first arch_prctl.

The performance story was equally instructive. Ptrace is simple and correct but carries the tax of a context switch on every syscall. Seccomp with USER_NOTIF lets the kernel continue doing what it does well — handling the routine calls — while routing only the interesting ones to your handler. The thirty-seven-fold improvement is not an optimization; it is a fundamentally different architecture for the same problem.

But perhaps the most valuable thing I carried away from this project is harder to quantify. Reading two hundred thousand lines of source code is one kind of understanding. Implementing the core loop yourself and watching it break — discovering the invariants that the source code assumes but never explains — is another kind entirely. There is something in the act of building that no amount of reading can replace, a point where the architecture stops being a diagram and becomes, for a brief and clarifying moment, a machine you can hear running.

The source is at github.com/mtclinton/mini-sentry. To run it: make build && make guest && ./mini-sentry ./cmd/guest/guest. For seccomp mode: ./mini-sentry --platform=seccomp ./cmd/guest/guest. For the benchmark that produced the numbers above: ./mini-sentry --benchmark ./cmd/guest/guest.