What is ITDR (Identity Threat Detection and Response)?

ITDR is a category of security controls that monitor identity behavior — not just authentication outcomes — and act on anomalies. Where IAM decides whether a user can log in, ITDR decides whether the way they are logging in right now looks like them. Behavioral baselines, peer-group comparisons, and risk scoring drive automated response actions like step-up MFA or session quarantine.

Does Cisco ISE include ITDR natively?

No. ISE handles every authentication as a discrete event. It does not maintain per-identity behavioral baselines, does not produce an identity risk score, and does not have a native quarantine-on-anomaly response loop. ISE provides the telemetry (pxGrid) and the enforcement mechanism (CoA via ERS) — the scoring and decision layer must be built on top.

How does pxGrid enable real-time identity threat detection?

pxGrid 2.0 is a WebSocket-based publish-subscribe bus that streams every RADIUS and TACACS+ session event from ISE in real time, with full attribute fidelity. An ITDR subscriber receives session-started, session-updated, and session-ended events as they happen, which is the foundation for behavioral scoring that completes the loop within seconds rather than minutes.

What is the difference between SIEM correlation and ITDR?

SIEM correlation stitches events from ISE, Stealthwatch, XDR, and other sources after the fact — minutes or hours later. ITDR evaluates behavioral risk at the moment of the RADIUS exchange and can quarantine before the session establishes. Same telemetry; different latency budget and different placement in the decision loop.

Can you build production-grade ITDR for under fifty dollars per month on AWS?

Yes, for a single enterprise at roughly 10,000 authentications per day. The architecture is Fargate for the persistent pxGrid subscriber, three Lambda functions for scoring and response, DynamoDB for baselines and active scores, S3 plus Athena for the raw event archive, and SNS to Slack for alerts. The persistent Fargate task is the only non-serverless piece and the largest single line item.

ITDR on Cisco ISE: Behavioral Identity Scoring with AWS Serverless

Cisco ISE knows who authenticated, where, when, and how. It does not know whether that authentication looks like that identity.

ISE treats every RADIUS or TACACS+ exchange as a discrete event. There is no behavioral baseline. No identity risk score. No native ITDR. The way most teams close that gap today is SIEM correlation — stitch ISE, Stealthwatch, XDR, and Splunk together and let the SOC see the anomaly minutes or hours after the fact. By then the session is established, the lateral move happened, the screenshot is in the breach report.

The interesting question is whether you can close the loop in seconds instead of hours using telemetry Cisco already collects and an enforcement mechanism ISE already provides. The answer is yes, and the architecture is small enough to ship in two weekends.

This post walks through the design.

The gap

The shape of the problem is simpler than the headlines make it sound.

Today’s ISE is stateless. ITDR needs state. Comparison of SIEM correlation after the fact vs risk score as a policy condition.

ISE’s policy engine evaluates an auth event against static conditions: which NAD, which SSID, which AD group, which MAC address. Those conditions either match or they do not. There is no concept of “this auth looks unlike previous auths from this identity.” There is no risk score that can be a policy condition.

The fix is not a new product. The telemetry already exists. The enforcement mechanism already exists. The missing piece is a scoring loop that watches every event, compares it to a per-identity behavioral baseline, and pushes the resulting risk score back into ISE in time for the policy engine to act on it.

Architecture: closed loop through ISE

The full system fits on one diagram. Enterprise on the left, ISE in the middle, AWS-hosted ITDR engine on the right.

Three-tier architecture diagram showing pxGrid subscriber on Fargate, Risk Lambda scoring engine, DynamoDB state, S3 plus Athena data lake, and Response Engine pushing CoA back to ISE

Three steps repeat for every session:

Ingest. A Fargate task subscribes to pxGrid 2.0 and streams every RADIUS and TACACS+ event into Kinesis with full attribute fidelity. Fargate (not Lambda) because pxGrid is a persistent WebSocket connection — Lambda’s 15-minute execution cap would force a reconnect every quarter hour, and the resulting cold-starts would lose events.
Score. A Risk Lambda compares each incoming event to the identity’s baseline in DynamoDB. Six signals (next section), each weighted, summed into a 0-100 score. Score and supporting attributes get written to DynamoDB; raw events land in S3 for audit and offline analysis via Athena.
Enforce. When the score crosses the configured threshold, a Response Engine Lambda calls ISE ERS with a CoA reauth that pushes the session into a quarantine VLAN. The session disconnects, reauthenticates, lands in the new VLAN. Total elapsed time from auth event to quarantine: seconds.

The control loop is entirely closed within the existing Cisco ecosystem. No new agent on the endpoint. No NAC replacement. No “next-gen” anything. ISE keeps doing what it already does well; the ITDR layer adds the scoring and decision behavior on top.

Six behavioral signals, weighted

Deep learning is not required. Explainable beats clever for a security control that has to be defensible during an incident review.

Six behavioral signals: Time-of-day weight 25, New location/NAD weight 30, Device fingerprint weight 15, Impossible travel weight 40, Peer-group drift weight 20, TACACS+ burst weight 35

Signal	Weight	What it catches
Time-of-day	+25	Histogram of auth times per identity. An auth outside the ±2σ window fires this signal. Catches credential-theft auths at 3 AM.
New location / NAD	+30	An auth from a Network Access Device this identity has never used in the last 90 days. Strongest individual signal — most lateral movement crosses NAD boundaries.
Device fingerprint	+15	New OS + cert + MAC combination outside the identity’s baseline device set. Catches stolen credentials reused from an attacker laptop.
Impossible travel	+40	Two auths within a time window that no human can physically traverse between. The classic “logged in from Bogotá and Berlin in 90 seconds” signal.
Peer-group drift	+20	Identity behaving unlike others in the same AD group. Catches role creep and slow-burn compromised accounts that no longer act like their peers.
TACACS+ burst	+35	Service account hitting N switches in M minutes. Classic lateral-movement signature when an attacker pivots through admin access.

These weights are starting values. The point is they are static, inspectable, and defensible in a postmortem. Tuning happens in a config file, not in a black box. The V3 backlog (later in this post) replaces the static weights with learned weights per peer group — but only after V1 and V2 have run in production for long enough to have ground-truth labels.

A session with a Time-of-day anomaly (+25) plus a New NAD (+30) plus a New device (+15) scores 70 — already in the “elevated” band. Add impossible travel (+40) and you are at 110 and the session is quarantined before the auth completes.

What an ITDR overview should look like

The console is one screen. Real-time risk distribution plus active session triage. No drill-down maze, no “create a saved search,” no parsing JSON.

ITDR Console identity risk overview showing 3247 active sessions, 1184 identities, 28 elevated, 4 high/critical, and a risk score distribution histogram

Top-line cards: active sessions, identities, count of elevated and high/critical sessions. Each card shows velocity — “↑ 7 in last hour” tells the analyst whether risk is accumulating or steady-state.

The risk score distribution histogram is the answer to the operational question, “is my environment normal right now?” In a healthy state most sessions cluster in the 0-30 band (the green/teal mass on the left of the chart). When the right tail starts growing — the amber and red bins — something is happening. That visual is the first thing the SOC sees in the morning.

Backing data: pxGrid event rate at 8.4k events/minute is normal for ~3,000 active sessions. Below 2k/min and you are probably looking at a pxGrid connection problem.

Explainability: why was the session quarantined?

The first time you quarantine a vice president at 8 AM Monday, this is what saves you.

Decision Explainer showing the breakdown of a quarantined session — time-of-day anomaly +25, new NAD +30, new device fingerprint +15, impossible travel 0, peer-group within tolerance — total score 78 exceeds threshold 75, CoA reauth issued moving session to QUARANTINE_VLAN_999

Every quarantine emits a Decision Explainer record: which signals fired, what their weights were, what the final score was, and what enforcement action was taken. The session ID and the affected NAD identify the exact endpoint. The verdict line spells out the policy that triggered: Score 78 exceeds threshold quarantine ≥ 75. CoA reauth issued to bog-sw-04 moving session into QUARANTINE_VLAN_999.

This is non-negotiable for a security control that takes autonomous action. The hour after a VP sees their session drop is going to be a hallway conversation with the CISO. “The model said so” is not an answer. “Time-of-day was anomalous, the NAD is one this identity has never used, and the device fingerprint does not match — total 78, threshold 75” is an answer.

It is also the thing that makes the static weights defensible. A reviewer can see the calculation. A reviewer can change a weight. A reviewer can disable a signal during a known event (release window, office relocation, IT-mandated reimaging).

All serverless. All AWS.

One non-serverless piece: Fargate for the persistent pxGrid subscriber. Everything else fits inside Lambda execution windows comfortably.

ITDR stack diagram showing Ingest layer Fargate pxGrid 2.0 Kinesis, Compute Lambda × 3 Python NumPy, Storage DynamoDB S3 Athena, Integration ISE ERS OpenAPI SSM Parameter Store, Frontend React CloudFront API Gateway SNS Slack. POC monthly cost under fifty dollars per month for a single enterprise at ten thousand authentications per day.

Stack rationale:

Fargate + pxGrid 2.0 + Kinesis for ingest. Fargate because pxGrid is a long-lived WebSocket. Kinesis because back-pressure handling on event spikes is free with the service.
Lambda × 3 for the scoring path: parse, score, enforce. Python with NumPy for the histogram math. Sub-second p99 latency at the volume this design targets.
DynamoDB for hot state (current identity scores, baseline windows). S3 + Athena for the long-tail data lake — raw events archived indefinitely, queryable on demand for incident response.
ISE ERS + OpenAPI for enforcement. ERS for CoA reauth (the proven path); OpenAPI for the modern endpoint group manipulation when ERS gets deprecated. SSM Parameter Store for the ISE credentials.
React + CloudFront for the console. API Gateway in front of read APIs. SNS to Slack for alerts on high/critical scores.

For a single enterprise at roughly 10,000 authentications per day, the monthly bill comes in under $50. The largest line item is the persistent Fargate task. Lambda invocations and DynamoDB read/write capacity round out to a few dollars each. S3 storage is essentially free at that scale.

That price point is the unlock. ITDR vendors typically charge $4-10 per identity per month. For a 1,000-employee company that is $4,000-10,000 per month for behavioral identity detection. Building it on the telemetry you already have, with the enforcement mechanism you already have, changes the conversation.

Two weekends to closed loop

The fastest way to ship something this size is to cut scope hard and prove the CoA before you optimize anything else.

Shipping plan V1 in two weekends prove the loop with one signal time-of-day anomaly only quarantine VLAN syslog to Slack. V2 plus two weeks real product adds NAD device velocity peer-group TACACS+ burst swaps syslog for pxGrid ships React dashboard builds Decision Explainer. V3 backlog adds Duo step-up MFA forwards to Splunk XDR replaces static weights with learned weights per peer group.

V1 — 2 weekends. Prove the loop. Syslog (not pxGrid yet) → Kinesis → Scoring Lambda → DynamoDB → Response Engine → CoA. One signal: time-of-day anomaly. One response: quarantine VLAN. Alerts to CloudWatch + SNS to Slack. The whole point of V1 is to prove that a Lambda can issue a CoA against ISE and the session actually quarantines in seconds. That is the architectural risk. Everything else is incremental.

V2 — +2 weeks. Real product. Add the remaining five signals — NAD, device, velocity, peer-group, TACACS+ burst. Swap syslog ingestion for pxGrid 2.0. Ship the React dashboard with the risk score distribution and the session triage view. Build the Decision Explainer so quarantines are defensible. This is what most environments would call “production ready.”

V3 — Backlog. Once V2 settles. Add Duo (or any TOTP/FIDO provider) as a third response action — step-up MFA before quarantine for medium-risk scores. Forward enriched events to Splunk or your existing XDR. Replace the static weights with learned weights per peer group, using the labeled corpus from V2.

The crucial constraint on V1 is “prove the CoA.” Every other temptation gets deferred. Beautiful dashboard before the CoA works? No. Six signals before the CoA works? No. ML before the CoA works? Definitely not. The CoA is the architectural integration point — if that fails, the whole design fails. If it works, the rest is content.

Same telemetry. Same enforcement. Built differently.

The thing I keep coming back to with this design is that nothing in it is novel. pxGrid is a decade-old technology. CoA is older. Lambda scoring against DynamoDB baselines is a pattern that has existed for years in payments fraud detection. AWS serverless cost economics are well understood.

What is new is the assembly. The 2024-2026 wave of identity-based attacks — Lapsus$, Scattered Spider, Storm-0501, MFA fatigue, helpdesk social engineering — does not care about the perimeter, does not care about endpoint hardening, and frequently passes ISE policy cleanly because the credentials are legitimate. The only signal that catches those attacks is behavioral, evaluated in seconds, with an enforcement loop that closes back to the access decision before the session matters.

The vendors selling that today price it like a separate product. The telemetry to do it has been on every ISE deployment for ten years. The Lambda math is high school statistics. The cost is rounding error against an ISE subscription.

If you have thought about behavioral identity scoring on top of ISE, or you are already doing something similar — I would like to compare notes. The architecture is straightforward enough that what is missing is mostly the public conversation about it. Drop a comment.

Building a Production-Grade Cisco ISE Deployment in One Sitting — the underlying ISE deployment this layer sits on top of
The TACACS+ Lockout That Taught Me Three Things About IOS-XE AAA — companion postmortem on the device-admin half of ISE
Non-Human Identity Security Explained — The 45:1 NHI Crisis in 2026 — adjacent identity-security context
MFA Fatigue Attacks: Palo Alto Unit 42 Analysis for Security+ Students — one of the attack classes this ITDR layer is specifically designed to catch
Splunk + Cisco ISE Syslog Integration RADIUS Dashboard — the SIEM-correlation approach this design is an evolution of

Architecture sketched against Cisco ISE 3.4 with pxGrid 2.0, Catalyst 9800-CL and CSR8000V as NADs, AD as the identity source, and AWS us-east-1 for the ITDR engine. The full ISE base layer is in the cisco-ise-automation repo; the ITDR engine code is in a separate repo coming next.

🎯 Studying for CCIE Security?

Practice with free flashcards, quizzes, and hands-on lab scenarios at cciesec.it-learn.io — built specifically for the CCIE Security v6.1 written (350-701 SCOR) and lab exam.