Closing the attach race without NRI

The single hardest engineering problem in a per-pod egress enforcer is not the policy language, the BPF map shape, or the DNS proxy. It is the attach race.

Between the moment a Kubernetes pod’s cgroup is created and the moment our cgroup_skb/egress program is attached to it, any connect() syscall escapes our enforcement. For a CI runner this matters more than for a long-running workload: a CI job runs for tens of seconds, and if the malicious code is in npm install’s postinstall, the first outbound packet can leave the runner within the first second of the build container starting. We have to be enforcing before that.

This post is about how, and about the obvious-looking solution we explicitly rejected.

Trust assumption: the platform team owns the runner

Before any of the architecture below makes sense, the trust boundary has to be stated. Leitwacht assumes the platform team controls the GitLab Runner deployment: the config.toml, the runner’s pod-spec patch, the BACKEND_ADDR of the agent, the namespaces runners run in. Pods that bypass the init-container barrier, created by a workload outside the runner’s reach, or a gitlab-runner deployment in a namespace we don’t expect, are not the threat model.

Inside that boundary the architecture is fail-closed by construction. Outside it (a developer with kubectl apply rights running their own pod), Leitwacht is not a sandbox. Operators who need cluster-wide enforcement layer a ValidatingAdmissionWebhook on top that requires the init container on every pod in labelled namespaces: a small standalone artefact, but explicitly out of scope for this post.

The rest of this document describes the architecture inside that trust boundary.

What we don’t do: NRI

The textbook synchronous answer is an NRI (Node Resource Interface) plugin. NRI plugins run inside containerd / CRI-O; runtime callbacks block synchronously on plugin response; RunPodSandbox fires before the container’s first PID exists. We tried it. It works.

It’s also wrong for this product.

NRI plugins are node-scoped and synchronous-on-pod-start. Once registered they intercept every pod start on the node: runner pods, system pods, sidecar pods, every workload your platform team runs. A buggy NRI plugin doesn’t fail one pod; it stalls or crashes the runtime’s pod-creation loop. The blast radius is the whole node.

That cost gets paid even when most pods on the node have nothing to do with CI. The kube-system pods, the ingress controllers, the observability daemonsets all now route through our plugin every time they start, just so we can be ready for the runner pod whose policy we actually care about.

Installation also requires a node-level config change (writing to /etc/nri/conf.d/, reloading containerd or CRI-O). On managed Kubernetes (GKE, EKS, AKS) you can’t do that without escaping the managed boundary. On a lot of platform teams’ threat models, modifying the container runtime is a red line.

So: synchronous-on-pod-start gives the right race property, but the cost is a runtime that becomes load-bearing on our binary’s correctness. The right answer is to have the right race property without the runtime coupling.

What we do: containerd events + a zero-cap init container

The actual architecture has two pieces:

  1. An init container in the runner pod. Drops ALL capabilities, runs as non-root, no volume mounts, image is a tiny DNS client. It polls a reserved DNS name (_leitwacht.attached.local) for a TXT record that the agent’s per-netns DNS proxy only answers once enforcement is live in that pod’s network namespace. It exits 0 the moment it sees that record (and the agent version is compatible). If the agent is down, the name never resolves, the init container times out (default deadline 20s), and the pod fails non-zero. The kubelet’s init-container ordering guarantee then refuses to start the build container: fail-closed by the K8s primitive itself, not by anything we have to engineer.

  2. A privileged DaemonSet (the agent). Subscribes to the containerd events socket. On /tasks/start events for pause containers (identified by the io.cri-containerd.kind=="sandbox" label), it derives the pod’s parent cgroup from the sandbox cgroup, attaches cgroup_skb/egress there, installs nftables DNS redirects in the pod’s network namespace, and primes the netns DNS proxy. The per-netns DNS proxy is what answers the init container’s poll, so the answer is only available once that pod’s enforcement is in place.

The crucial property: containerd events are asynchronous and non-blocking. We subscribe; containerd doesn’t care whether we keep up. If our process crashes, containerd’s pod-creation loop is unaffected. The synchronization is provided not by intercepting the runtime, but by the init container’s poll, which provides a per-pod synchronization barrier inside the K8s pod lifecycle without touching the runtime at all.

build containerinit containeragent (DaemonSet)containerdkubeletbuild containerinit containeragent (DaemonSet)containerdkubeletRunPodSandbox (pause)1create pod cgroup + netns2/tasks/start (kind=sandbox)3attach cgroup_skb/egress to pod-parent cgroup4install nftables DNS redirect in pod netns5prime per-netns DNS proxy6start init container7DNS TXT query _leitwacht.attached.local8attached version=X9exit 010start build container (only now)11first connect(), already enforced12

The agent loop

func (a *Agent) Run(ctx context.Context) error {
    client, err := containerd.New("/run/containerd/containerd.sock")
    if err != nil { return err }

    // Subscribe to /tasks/start (and /tasks/exit for cleanup). The K8s
    // pause container fires /tasks/start strictly before any user
    // container in the pod runs: that's the synchronous edge we need.
    ch, errc := client.Subscribe(ctx,
        `topic=="/tasks/start"`,
        `topic=="/tasks/exit"`)

    for {
        select {
        case ev := <-ch:
            ts, ok := ev.Event.(*eventtypes.TaskStart)
            if !ok { continue }

            // Pod sandboxes are identified by a containerd CRI label, not
            // by a separate event type: that's where the "sandbox" comes
            // from on this socket.
            ctr, err := client.LoadContainer(ctx, ts.ContainerID)
            if err != nil { continue }
            labels, _ := ctr.Labels(ctx)
            if labels["io.cri-containerd.kind"] != "sandbox" {
                continue
            }

            spec, err := ctr.Spec(ctx)
            if err != nil { continue }
            // The sandbox's own cgroup is the container scope; the pod
            // parent slice is its parent, which is where we attach.
            podCg := podCgroupFromContainer(spec.Linux.CgroupsPath)

            if err := a.attach(ts.ContainerID, podCg); err != nil {
                slog.Error("attach failed", "pod", ts.ContainerID, "err", err)
                continue
            }

        case err := <-errc:
            return fmt.Errorf("containerd events: %w", err)
        case <-ctx.Done():
            return nil
        }
    }
}

func (a *Agent) attach(sbID, cgPath string) error {
    cgFD, err := openCgroupV2(cgPath)
    if err != nil { return err }

    // Attach to the POD-PARENT cgroup. Cgroup BPF inheritance means every
    // container in the pod (init, helpers, build, sidecars), including
    // sub-cgroups that don't exist yet, runs through the program.
    if _, err := link.AttachCgroup(link.CgroupOptions{
        Path:    cgFD.Path(),
        Attach:  ebpf.AttachCGroupInetEgress,
        Program: a.programs["lw_egress"],
    }); err != nil { return err }

    // Install nftables DNS redirect inside the pod's netns (DNS → :15353).
    if err := a.installNftables(sbID); err != nil { return err }

    // Prime the netns DNS proxy. At sandbox-start this is an audit-all
    // policy; the real project bundle is swapped in when the build
    // container attaches to the same netns.
    return a.primeNetns(sbID)
}

Two architecturally important details:

Pod-parent cgroup, not container cgroup. A K8s pod has a parent cgroup (the pod<UID>.slice) under which each container gets its own scope. We attach cgroup_skb/egress at the parent. cgroup BPF inheritance means the program runs on egress from any descendant, including the build container’s scope, which doesn’t exist yet at attach time. The kernel handles inheritance automatically when descendants are created. Net effect: no race between the agent’s attach and the build container’s first packet.

Pause container as the synchronous edge. The K8s CRI runtime starts the pause container first to establish the pod’s network and IPC namespaces, before any user container. Its /tasks/start fires on the containerd socket strictly before the build container exists. That’s the moment we want to attach BPF: the cgroup is created, the netns is up, no user code is running yet.

Async event delivery, not synchronous callback. Containerd does not wait for the agent to process the event. If the agent is slow, the init container’s poll takes longer to resolve; if the agent is broken, the name never resolves and the pod fails. The runtime is unaffected either way. This is the property NRI doesn’t give you, and the property we wanted.

What the init container looks like

Three lines of shell, basically. The pod-spec patch in the GitLab Runner config is the entire install:

# config.toml: GitLab Runner Kubernetes executor
[[runners]]
  name = "k8s-runner"
  url  = "https://gitlab.com/"
  executor = "kubernetes"

  environment = ["FF_USE_ADVANCED_POD_SPEC_CONFIGURATION=true"]

  [runners.kubernetes]
    namespace = "gitlab-runner"

    [[runners.kubernetes.pod_spec]]
      name = "leitwacht"
      patch = '''
        spec:
          initContainers:
            - name: leitwacht-attach
              image: registry.gitlab.com/leitwacht/leitwacht-initc:v0.5.0
              securityContext:
                runAsNonRoot: true
                allowPrivilegeEscalation: false
                capabilities:
                  drop: ["ALL"]
              command: ["/leitwacht-initc"]
      '''
      patch_type = "strategic"

That’s the whole install. No node access, no runtime configuration, no NRI plugin registration, no privileged init container. The [[runners.kubernetes.pod_spec]] block plus FF_USE_ADVANCED_POD_SPEC_CONFIGURATION=true is the GitLab Runner mechanism for arbitrary pod-spec injection (docs). The runner re-reads its config on SIGHUP, no restart required.

The init container’s job is one DNS lookup in a loop:

until deadline:
  txt = lookup TXT _leitwacht.attached.local.
  if txt starts with "attached" and version compatible: exit 0
  sleep poll-interval
exit 1

Zero-cap, no mounts, fails closed. The agent’s per-netns DNS proxy is the synchronization primitive. The poll defaults to a 20s deadline at 200ms intervals, and a TXT record carries the agent version so the init can refuse an agent older than itself.

Latency and failure modes

The path is short by construction:

  • The containerd event is delivered asynchronously over the events socket as soon as the pause container’s task starts.
  • The agent’s attach work (derive pod cgroup, attach cgroup_skb/egress, install the nftables DNS redirect, prime the netns proxy) happens in one pass; the agent logs its duration_ms per attach.
  • The init container does a single DNS lookup per poll interval (default 200ms) and exits as soon as the record resolves.

In practice the init barrier overlaps with the init container’s own image pull and startup, which is typically the larger cost, so the added pod-startup latency is dominated by work that would happen anyway rather than by the attach itself.

Failure modes, ordered by likelihood:

  1. Agent down. The init container’s TXT lookups never resolve to an attached record; init exits non-zero at the deadline; kubelet refuses to start the build container; pod fails. The init emits a structured JSON error to stderr ("leitwacht did not attach within deadline"), which operators can grep in Loki. Fail-closed.
  2. Agent up but slow on attach. The init container keeps polling for the configured deadline (default 20s). If the agent primes that pod’s netns proxy within the window, the record resolves and the pod proceeds. Otherwise same as (1).
  3. Containerd event lost (agent restarted mid-event). The events stream is fire-and-forget, no replay buffer. Recovery: on agent startup, the watcher lists running containers (client.Containers(ctx)) and re-emits a sandbox-start event for any running kind=sandbox pause container, which drives the same pod-cgroup attach a fresh event would. Recovery runs once at startup.
  4. A pod created without our init container. Rogue runner config, or a workload masquerading as a runner. The agent sees the sandbox event and attaches BPF anyway, but without the init container synchronization there is no fail-closed guarantee that the build container hasn’t already started.

The first failure mode is the only one that affects a normal CI job in production deployments, and it fails closed. That’s the design property the architecture buys.

What this trades against NRI

For completeness, the matrix that drove the choice:

NRI plugincontainerd events + init-container barrier
Race-freeyes (synchronous interception)yes (init-container barrier blocks build container)
Failure blast radiusentire node’s pod-creation loopone pod (init container fails, build container fails to start)
Install touches node configyes (/etc/nri/conf.d/, runtime restart)no (GitLab Runner config.toml only)
Works on managed K8s (GKE/EKS/AKS)only with custom node imagesyes, unmodified
Affects non-runner pods on the nodeyes (intercepts every pod)no (pod-spec patch only on runner pods)
Init container caps requiredn/anone (drop: ["ALL"])
Agent caps requiredprivileged (BPF, NET_ADMIN)privileged (BPF, NET_ADMIN), same as NRI version
Failure when subscriber brokenruntime stallspods fail per-init-timeout

The first column has the elegance. The second column has the operations story.

What’s next

The containerd-events architecture opens up two follow-ons:

  • CRI-O parity. The agent’s container lifecycle is already behind a ContainerWatcher interface, so a CRI-O implementation that watches its socket would land the same race property on OpenShift / RHCOS clusters. Not implemented yet.
  • Credential and /proc/<pid>/mem LSM hooks under the same barrier. The agent already ships lsm/file_open programs (kernel 5.7+ with CONFIG_BPF_LSM=y) that watch for credential file opens (/.ssh/id_rsa, /.aws/credentials, /.kube/config, etc.) and deny /proc/<pid>/mem opens by monitored containers with EPERM, unless the container is in observe-only mode. Folding their attach into the same pod-sandbox edge as the egress filter, so they go live before the build container’s first instruction too, is the work in progress. Same dispatch, one layer deeper.

Both lean on the synchronous-init-barrier plus async-runtime-discovery pattern. Without that split, you’re back to either NRI’s blast radius or a K8s informer’s latency.