The Quieter Sandbox - On Generating seccomp by Observation

Some weeks back I was looking at the seccomp profile that had shipped, alongside one of our services, into production, and admitted I had no precise account of what it permitted. The profile allowed three hundred and fifty system calls. The service made, on a generous accounting of its hot path, perhaps thirty. The other three hundred and twenty had been allowed there because they were in the Docker default we had copied from, and the Docker default contained them because someone, some years ago, had decided it was easier to be liberal than to be specific. It is the architectural equivalent of the famous chair in the corner of the office — the one nobody sits in, with the small sign on it that says do not move.

The other half of the seccomp profiles I have seen in the wild, I should add, are written by hand against a checklist. They are tighter. They are also, on the day someone calls a code path nobody had thought to test, the cause of an entirely avoidable post-mortem.

What I should have liked, three months ago, was a tool that would simply observe what the service did, for a representative run, and write down the list it had observed. Such a tool exists, in the loose sense — strace -c will produce a syscall summary if asked, and there are several published shell scripts that pipe its output through a JSON formatter. I looked at these and concluded, for reasons I will come to, that they were not the thing. So I wrote one.

What I built is a small Rust binary called sandprint. It traces a target process — by command, or by attaching to a running PID — through eBPF, watches every system call that process or any of its children make, and at the end emits a tight allowlist in whatever format the sandbox tool ultimately wants: an OCI runtime profile, a SystemCallFilter= line for systemd, a libseccomp C header, raw JSON. The code is at github.com/mtclinton/sandprint, Apache-2.0; about twenty-five hundred lines of Rust and eighty lines of BPF C, which is small enough to read in an afternoon.

A Short Demonstration

The pitch, in thirty seconds:

plain

$ sudo sandprint profile run -- ls /tmp
... INFO BPF tracer loaded and attached
... INFO tracing command pid=2409773
[directory listing]
... INFO child exited status=Exited(Pid(2409773), 0)
Observed 202 syscall events (27 unique syscalls)

   NR       COUNT  NAME
    9          36  mmap
  257          35  openat
    3          26  close
    5          23  fstat
    1          19  write
   59          11  execve
    0           9  read
   10           6  mprotect
  157           6  prctl
   12           5  brk
  217           5  getdents64
   ...

That is ls /tmp running under sandprint. Twenty-seven unique syscalls is the honest, observed footprint of a directory listing — getdents64 and statx for the entries, write for the output, plus the usual glibc startup ritual. A seccomp profile derived from this run is twenty-seven syscalls. A generic container default is north of three hundred. The two profiles do not, on inspection, agree on what ls is for.

plain

$ sandprint profile generate --input trace.json --format oci > seccomp.json
$ wc -l seccomp.json
   71 seccomp.json

Drop that into the OCI runtime spec under linux.seccomp and one has, against modest expectations, a working sandbox.

The Hundred Lines of BPF

The kernel-side of sandprint is a BPF program of roughly that length, attached to three raw tracepoints. sys_enter fires on every system call entry, and the program — when the calling task is in our tracked set — pushes a forty-byte event onto a ring buffer. sched_process_fork admits any child of a tracked task to the tracked set automatically; this is how process trees get followed without any userspace bookkeeping. sched_process_exit evicts the dead tasks so the tracked-PID hashmap does not grow without bound over a long trace.

The userspace side is a single thread that polls the ring buffer until the target process exits, and converts the event log into a canonical JSON trace at the end.

The whole arrangement is unfussy. The BPF program declares only what it must; for example, it needs to read task->tgid to identify the task group, but it does not bring in the whole task_struct. Instead it declares a one-field stub:

struct task_struct {
    int tgid;
} __attribute__((preserve_access_index));

CO-RE — Compile Once, Run Everywhere — does the rest. libbpf rewrites the field offset at load time, using BTF (BPF Type Format) information from the running kernel. The same compiled BPF object loads against a 5.10 kernel, a 6.1 kernel, and whatever Ubuntu chose to ship on its newest LTS, without rebuilding. There is no vmlinux.h checked into the repository and no per-distribution build matrix. The cost of this independence is that any kernel without a /sys/kernel/btf/vmlinux will refuse to load the program — which, in practice, is a non-concern on any kernel from the last few years.

The program uses only verifier-friendly constructs: bounded array indexing, no unbounded loops, atomic counter updates, paired ring-buffer reserve and submit. The verifier does not, on inspection, complain about it. One is grateful for this, having had occasion in past work to be lectured by the verifier at length.

What strace Could Not Do

The first question one asks of any new tool is whether it duplicates an old tool. strace -c has been counting system calls and post-processing them into summaries since I was learning what a system call was. Two reasons it did not, on a careful look, suffice.

The first is performance. ptrace-based tracing — which is what strace does — is a non-trivial slowdown on the target. Each traced syscall takes a round-trip through ptrace’s wakeup machinery; for a workload that is at all perf-sensitive or timing-sensitive, the version of the workload running under strace behaves differently from the version running in production. One ends up profiling the strace, not the service. eBPF tracing, by contrast, is essentially free at this scale: the tracepoint fires anyway, the BPF program drops forty bytes onto a ring buffer, and userspace reads the buffer asynchronously. There is no signalling, no syscall round-trip per traced call, no per-call overhead detectable in benchmarks.

The second reason is conflict. ptrace is exclusive — a process being ptraced by one tool cannot be ptraced by another. One cannot strace a process that is already under gdb, and one cannot gdb a process that is already under strace. eBPF tracing has no such constraint. A process can be traced by sandprint, ptraced by gdb, and audited by auditd, all simultaneously, with none of the three knowing about the others. This is the kind of property one does not value until one needs it.

The cost of going the eBPF route is the capability requirement. The tracer needs CAP_BPF + CAP_PERFMON, or CAP_SYS_ADMIN on older kernels, and the kernel needs the BTF entry under /sys/kernel/btf/vmlinux. On any kernel from the last few years both are standard. On older or hardened kernels they are not, and one is back to ptrace; sandprint does not pretend to be the tool for those cases.

The Trouble With Watching

A profile generated from observation has a defect that a profile written from a specification does not: it can only contain what one’s run actually executed. If the test harness one ran sandprint against does not exercise the error path that calls prctl, the generated profile will block prctl in production. The first time the error path runs in earnest, the kernel will deliver a SIGSYS to the offending process, and one will have, at that moment, a perfectly characteristic and entirely avoidable production incident.

This is fundamental to any observation-based tool and is not specific to sandprint. oci-seccomp-bpf-hook, syscall2seccomp, the various shell scripts that derive profiles from strace -c outputs — all share it. The honest framing is that a generated profile is a starting point for tightening, not a finished verification.

The workflow that has, on a working basis, served me is roughly: run sandprint while exercising every code path one cares about, including CI integration tests, soak tests, and error paths; merge the resulting traces with profile merge to union them into a single profile; apply that profile in staging and watch the kernel logs for SIGSYS and EPERM events; iterate. The generated profile is wrong on the first run, less wrong on the third, and acceptably tight by the time it has caught everything one’s CI suite knows how to catch.

profile diff between two runs has a useful side benefit. It tells one which test paths exercise which syscalls — and, by absence, which paths exercise no new syscalls and are therefore candidates for pruning from the CI matrix. I had not anticipated this when writing the tool. It has been, on more than one occasion, the bit of the tool I most use.

What I Should Like to Build Next

A syscall allowlist alone is a meaningfully weaker sandbox than a syscall allowlist plus a path allowlist. The next feature on the list is argument capture for open and openat: reading the path argument out of the kernel as the call enters, and emitting an allowlist of paths alongside the syscall list. The mechanism is more involved — one has to walk task->files->fdt->fd[i] for the file-descriptor case, or process_vm_readv the userspace pointer for the path case — but it is tractable, and the kernel side of it is no more complicated than what sandprint already does for the syscall numbers themselves.

Network syscall argument capture (socket family, type, protocol) is the same mechanism applied to a different argument list. Container-runtime integration as a containerd NRI plugin or a runc hook is the most ambitious item; it would let one flip sandprint on for any container workload without changing the command line, and would close the loop with the OCI profile output that already exists. None of the three is hard in any architectural sense. They are, in the language one uses about such things, a matter of finding the time.

The code is at github.com/mtclinton/sandprint, Apache-2.0. Issues and pull requests are, against the standard caveat that I will respond to them on a hobbyist’s schedule, welcome.