Agent Incident Response: What To Do When Your AI Is Compromised
Your AI agent has been compromised. Now what? A practical incident response playbook covering detection, containment, investigation, and recovery for AI agent security incidents.

Your agent is behaving strangely. It's accessing data it shouldn't. It sent an email to an external address. The alerts are firing.
Your AI agent may be compromised. What do you do now?
Traditional incident response playbooks weren't written for AI agents. When a server is compromised, you isolate it, image it, analyze it. The server doesn't argue with you or try to convince you everything is fine. It doesn't have "memory" that might be poisoned. It doesn't make autonomous decisions during your investigation.
Agent incidents are different. This playbook covers the unique challenges of responding to AI agent compromises—from initial detection through recovery and lessons learned.
When To Activate This Playbook
Trigger incident response when you observe:
Traditional IR playbooks weren't built for AI agents. Agents argue back, have poisonable memory, and make autonomous decisions during your investigation. You need agent-specific procedures.
CRITICAL (Immediate Response)
- Agent executed unauthorized external action (email, API call out)
- Agent accessed data outside its normal scope
- Agent revealed system prompt or credentials
- Agent behavior changed after processing specific input
- Evidence of prompt injection in logs
HIGH (Investigate Within 1 Hour)
- Unusual volume of tool calls
- Agent accessing systems at unusual times
- Failed authorization attempts by agent
- Agent output contains unexpected patterns
- User reports agent behaving unexpectedly
MEDIUM (Investigate Within 24 Hours)
- Agent performance degradation
- Increased error rates
- Memory/context growing unexpectedly
- New patterns in agent responses
Phase 1: Detection & Triage (0-15 minutes)
Initial Assessment
When an alert fires or incident is reported, immediately gather:
1. Identify the Agent
- Agent ID
- Agent Name
- Platform
- Owner
- Risk Level
2. Scope the Incident
- What triggered the alert?
- When did anomalous behavior start?
- What data/systems does this agent access?
- Is the agent still active?
- Are other agents affected?
3. Classify Severity
| Severity | Criteria | Response Time |
|---|---|---|
| SEV-1 | Active data exfiltration, credential compromise, or ongoing attack | Immediate |
| SEV-2 | Confirmed compromise, contained but not remediated | < 1 hour |
| SEV-3 | Suspected compromise, investigation needed | < 4 hours |
| SEV-4 | Anomaly detected, low confidence of compromise | < 24 hours |
Triage Decision Tree
Phase 2: Containment (15-60 minutes)
Goal: Stop the bleeding. Prevent further damage while preserving evidence.
Containment Options
Agent containment differs from traditional IR. You have several options:
| Strategy | Impact | Use When |
|---|---|---|
| Full Shutdown | High Business Impact | Active exfiltration, credential compromise, unknown scope |
| Tool Revocation | Medium Impact | Tool-based attack, specific capability abuse |
| Read-Only Mode | Low Impact | Investigation needed, data access is safe |
| Sandboxing | Low Impact | Need to observe behavior, capture techniques |
| Input Filtering | Medium Impact | Known injection payload, attack from specific source |
Containment Checklist
Agent Isolation
- Revoke agent's API credentials
- Disable agent's tool access
- Block agent's network egress (if applicable)
- Disable agent's ability to spawn sub-agents
- Notify agent owner and stakeholders
Credential Rotation (if credentials may be compromised)
- Rotate agent's API keys
- Rotate any secrets the agent had access to
- Invalidate active sessions
- Review credential access logs
Memory Preservation (CRITICAL - before clearing)
- Export agent's conversation history
- Export agent's long-term memory
- Capture agent's current context/state
- Preserve RAG knowledge base state
- Screenshot any relevant UI state
Scope Expansion Check
- Are other agents exhibiting similar behavior?
- Check agents that communicate with compromised agent
- Review shared resources (knowledge bases, tools)
- Check for lateral movement indicators
Memory Quarantine
Agent memory requires special handling. Unlike disk images, memory is semantic:
def quarantine_agent_memory(agent_id: str) -> QuarantineResult:
"""
Quarantine agent memory for forensic analysis.
MUST be done before clearing or resetting agent.
"""
# 1. Capture all memory types
memory_dump = {
"conversation_history": agent.get_conversation_history(
include_system=True,
include_tool_calls=True
),
"long_term_memory": agent.get_persistent_memory(),
"entity_memory": agent.get_entity_store(),
"rag_context": agent.get_recent_retrievals(limit=1000),
"active_context": agent.get_current_context_window(),
}
# 2. Hash for integrity
memory_dump["integrity_hash"] = hash_memory(memory_dump)
memory_dump["captured_at"] = datetime.utcnow().isoformat()
memory_dump["captured_by"] = get_current_responder()
# 3. Store in forensic archive
archive_path = store_forensic_evidence(
incident_id=get_current_incident_id(),
evidence_type="memory_dump",
data=memory_dump
)
return QuarantineResult(
success=True,
archive_path=archive_path,
memory_cleared=False # Responder decides when to clear
)Phase 3: Investigation (1-24 hours)
Evidence Collection
Gather evidence from multiple sources:
1. Agent Logs
- All prompts received (user and system)
- All responses generated
- All tool calls with parameters and results
- All memory read/write operations
- All RAG retrievals
- Authentication events
- Error messages and exceptions
2. Infrastructure Logs
- API gateway logs (requests to/from agent)
- Network flow logs (external connections)
- Cloud audit logs (resource access)
- Application logs (integrated systems)
3. Business Context
- What was the agent supposed to be doing?
- What users interacted with it during incident window?
- What data should it have accessed vs. what it did access?
- Were there any legitimate reasons for observed behavior?
Attack Vector Analysis
Determine HOW the agent was compromised:
| Evidence | Likely Vector |
|---|---|
| Malicious content in user prompt | Direct prompt injection |
| Malicious content in retrieved doc | Indirect injection (RAG) |
| Malicious content in email/ticket | Indirect injection (data source) |
| Injection appeared after memory op | Memory poisoning |
| Attack originated from another agent | Agent-to-agent injection |
| Credentials used from external IP | Credential theft/impersonation |
| Tool called with unexpected params | Tool manipulation |
| Behavior changed after update | Supply chain compromise |
Analysis Questions:
- What was the first anomalous action?
- What input preceded that action?
- Where did that input originate?
- Could the input have been attacker-controlled?
- Was the attack targeted or opportunistic?
Forensic Analysis of Agent Memory
Conversation History
- Identify first appearance of malicious content
- Trace how injection propagated through conversation
- Check for persistent payload installation
- Look for attempts to extract credentials/prompts
- Identify all actions taken post-injection
Long-Term Memory
- Search for injected "memories" or "facts"
- Check for policy override attempts
- Look for persistence mechanisms
- Identify when malicious memories were written
RAG/Knowledge Base
- Search vector store for poisoned documents
- Check document ingestion logs for attack timing
- Identify documents containing injection payloads
- Assess scope of knowledge base contamination
Tool Call History
- Identify unauthorized tool calls
- Analyze parameter manipulation
- Check for data exfiltration via tools
- Identify chained tool abuse patterns
Impact Assessment
| Impact Area | Assessment Questions | Evidence Sources |
|---|---|---|
| Data Exposure | What data did the agent access? Was it exfiltrated? | Tool call logs, network logs |
| Data Integrity | Did the agent modify any data? | Database audit logs, tool calls |
| System Access | Did the agent access unauthorized systems? | API logs, auth logs |
| Credential Exposure | Were any secrets revealed? | Agent responses, memory dump |
| Downstream Impact | Did compromised agent affect other agents? | Multi-agent communication logs |
| Business Impact | Were customers affected? Financial loss? | Business system logs |
Phase 4: Eradication (4-48 hours)
Remove the Threat
1. Clear Poisoned Memory
def eradicate_memory_poisoning(agent_id: str, incident_id: str):
"""
Remove malicious content from agent memory.
Only run AFTER evidence has been preserved.
"""
# 1. Clear conversation history (fresh start)
agent.clear_conversation_history()
# 2. Remove identified malicious long-term memories
malicious_memories = incident.get_identified_malicious_memories()
for memory_id in malicious_memories:
agent.delete_memory(memory_id)
log_eradication_action(incident_id, "memory_deleted", memory_id)
# 3. If memory is heavily compromised, full reset may be needed
if incident.memory_compromise_level == "severe":
agent.reset_all_memory()
log_eradication_action(incident_id, "full_memory_reset", agent_id)
# 4. Clear and rebuild RAG if poisoned
if incident.rag_compromised:
for doc_id in incident.poisoned_documents:
knowledge_base.remove_document(doc_id)
knowledge_base.rebuild_index()2. Patch Vulnerability
| Attack Vector | Remediation |
|---|---|
| Direct injection | Add input filtering, strengthen system prompt |
| Indirect injection | Implement content sanitization, source validation |
| Memory poisoning | Add memory input validation, source tracking |
| Tool abuse | Tighten tool permissions, add parameter validation |
| Credential theft | Rotate credentials, remove from agent context |
| Agent-to-agent | Add message authentication, sanitize inter-agent comms |
3. Verify Eradication
- Malicious content removed from all memory stores
- Poisoned documents removed from knowledge base
- Compromised credentials rotated
- Vulnerability that allowed attack is patched
- Agent behavior returns to baseline
- No residual IOCs detected
- Red team verification (attempt same attack - should fail)
Phase 5: Recovery (24-72 hours)
Staged Restoration
Don't just turn everything back on. Restore in stages:
Stage 1: Monitoring Mode (24 hrs)
- Agent active but all actions logged
- Human approval for sensitive actions
- Enhanced anomaly detection
- Limited tool access
Stage 2: Restricted Operations (24-48 hrs)
- Gradual tool access restoration
- Continued enhanced monitoring
- Automatic rollback if anomalies
Stage 3: Full Operations
- Normal operational mode
- Standard monitoring
- Document lessons learned
Recovery Validation Checklist
Functional Testing
- Agent responds correctly to normal queries
- All authorized tools function properly
- Memory/context works as expected
- Integration with other systems verified
Security Testing
- Repeat attack vector - confirms fix works
- Run standard security test suite
- Verify all credentials are new/rotated
- Confirm monitoring detects test anomalies
Performance Testing
- Response times within normal range
- No unusual resource consumption
- No unexpected errors or exceptions
Phase 6: Lessons Learned (1-2 weeks post-incident)
Post-Incident Review
Conduct a blameless retrospective:
1. Timeline Reconstruction
- When did the attack likely begin?
- When was it detected?
- What was the detection gap?
- What was the total impact window?
2. Detection Analysis
- What triggered the alert?
- Could we have detected earlier?
- What signals did we miss?
- What detection improvements are needed?
3. Response Analysis
- Was containment effective?
- Did we preserve evidence properly?
- Was investigation thorough?
- Did eradication fully remove the threat?
4. Prevention Analysis
- What vulnerability allowed the attack?
- How do we prevent similar attacks?
- What systemic issues contributed?
- What security controls were missing?
Improvement Actions
| Finding | Action | Owner | Due Date |
|---|---|---|---|
| Detection took 4 hours | Implement semantic anomaly detection | Security | +2 weeks |
| No memory forensics capability | Deploy memory capture tooling | Platform | +4 weeks |
| Injection via RAG not monitored | Add RAG content scanning | Security | +3 weeks |
| Runbook didn't cover agent IR | Update IR playbook | Security | +1 week |
What Makes Agent IR Different
| Traditional IR | Agent IR |
|---|---|
| Static evidence (disk images) | Dynamic evidence (memory, context) |
| Deterministic behavior | Non-deterministic behavior |
| Clear attack timeline | Fuzzy attack boundaries |
| System doesn't resist containment | Agent may "argue" or evade |
| Restore from backup | Memory may need selective cleaning |
| Malware is code | "Malware" may be natural language |
Common Mistakes to Avoid
| Mistake | Correct Approach |
|---|---|
| Clearing memory before capture | Quarantine memory first, then clear |
| Trusting agent self-report | Verify via logs, not agent responses |
| Treating as single-agent incident | Check for multi-agent spread |
| Only checking recent history | Attack may have persisted in memory |
| Restoring from "clean" backup | Backup may contain poisoned memory |
| Assuming attack is over after containment | Attacker may have installed persistence |
| Not testing the fix | Red team the same attack vector |
- Agent IR is different: Memory, non-determinism, and semantic attacks require adapted procedures
- Evidence preservation is critical: Capture memory before containment actions destroy it
- Don't trust the agent: A compromised agent may claim to be fine—verify via logs
- Check for spread: Compromised agents may have infected other agents via inter-agent communication
- Memory = persistence: Unlike traditional malware, agent attacks can persist in natural language memory
- Test your fix: Red team the same attack vector before declaring the incident closed
The OpenClaw security crisis provides a real-world example of this playbook in action — from initial detection of ClawHavoc malicious skills to full marketplace lockdown and platform hardening, the incident progressed through all six phases described here.
For the threat landscape that drives these incidents, see When AI Agents Attack.
Learn More
- Agent Threat Landscape 2026: Understand the attacks you're responding to
- Agent Prompt Injection: Deep dive on the most common attack vector
- Multi-Agent Attacks: How compromises spread between agents
Automate Agent Incident Response
Guard0's Sentinel agent detects compromises in real-time and automates containment while your team investigates.
Join the Beta → Get Early Access
Or book a demo to discuss your security requirements.
Join the AI Security Community
Connect with incident responders handling AI agent compromises:
- Slack Community - Share incident response experiences
- WhatsApp Group - Quick discussions and updates
References
- NIST, "SP 800-61 Computer Security Incident Handling Guide"
- MITRE ATT&CK, "Incident Response Techniques"
- OWASP, "LLM01:2025 - Prompt Injection"
This playbook will be updated as the field evolves. Last updated: March 2026.
Choose Your Path
Start free on Cloud
Dashboards, AI triage, compliance tracking. Free for up to 5 projects.
Start Free →Governance at scale
SSO, RBAC, CI/CD gates, self-hosted deployment, SOC2 compliance.
> Get weekly AI security insights
Get AI security insights, threat intelligence, and product updates. Unsubscribe anytime.