Root Cause Analysis AI
Root Cause Analysis AI
Problem Solving & Analysis
The Prompt
The Logic
1. Timeline Reconstruction Prevents “Story Bias” and Makes Causality Testable
WHY IT WORKS: Teams often jump to causal explanations before agreeing on what happened. A minute-level timeline aligns everyone on sequence: change events, alerts, traffic shifts, dependency failures, and mitigation steps. Once time ordering is explicit, many hypotheses become falsifiable (“could not be the cause because it happened after the impact started”). This reduces narrative drift and the tendency to pick the most confident storyteller’s explanation. Timeline discipline also surfaces “hidden interventions” (manual restarts, feature flags, backfills) that quietly change system behavior. In complex incidents, timeline errors are common and can waste hours.
EXAMPLE: Suppose checkout failures began at 10:02, but an engineer remembers “we deployed at 10:15,” so the team blames the deploy. A reconstructed timeline shows a database connection pool saturation spike at 10:01, triggered by a traffic burst from an email campaign at 09:59. The deploy at 10:15 was actually mitigation, not cause. In postmortems, teams who build a timeline first typically reduce false attribution and produce higher-quality corrective actions (monitoring and scaling) instead of superficial fixes (“rollback”) that don’t address the real trigger.
2. Competing Hypotheses Reduce Anchoring and Increase RCA Accuracy
WHY IT WORKS: Human cognition anchors on the first plausible explanation, then selectively searches for confirming evidence. Requiring 3–5 competing hypotheses forces disconfirmation: you must ask “what evidence would prove this wrong?” This mirrors scientific reasoning and significantly improves RCA accuracy for ambiguous failures. It also prevents blame-driven hypotheses (“person X changed something”) and shifts toward system mechanisms. In reliability engineering, hypothesis-driven debugging reduces time-to-resolution and post-incident recurrence because fixes align with true mechanisms.
EXAMPLE: Incident: API latency doubled. Hypothesis A: bad deploy. Hypothesis B: dependency timeout (payment provider). Hypothesis C: database index regression. Hypothesis D: network packet loss. You then map each to observable evidence: deploy diff timestamps, dependency error codes, DB query plans, packet loss metrics. If payment timeouts began before any deploy and coincide with provider 5xx errors, Hypothesis B becomes primary. This avoids deploying “fixes” that do nothing. Teams that institutionalize hypothesis lists often see fewer repeat incidents because they test multiple failure channels rather than “fix the obvious.”
3. Fault Trees Expose Multi-Factor Failures Where No Single Cause Explains the Incident
WHY IT WORKS: Many incidents are not single-point failures; they are cascades: a traffic spike + missing rate limits + slow DB queries + insufficient autoscaling. A fault tree or cause map makes AND/OR relationships explicit, revealing that removing any one contributing factor might have prevented impact. This helps prioritize fixes: you can choose the cheapest “break the chain” action even if it’s not the deepest root cause. Fault trees also reduce political conflict because they show systemic interaction rather than a single scapegoat.
EXAMPLE: Checkout outage: OR node “payment gateway unreachable” OR “internal service crashed.” Internal service crash itself is an AND node: “thread pool exhaustion” AND “retry storm” AND “no circuit breaker.” The root isn’t just “payment gateway slow”; it’s also “we retried without backoff and had no bulkheads.” CAPA then includes circuit breakers, capped retries, and dependency budgets. Organizations using cause maps tend to improve resilience faster because they address multiple weak links rather than arguing over “the one true cause.”
4. Counterfactual Analysis Turns Postmortems Into Resilience Design
WHY IT WORKS: “What caused it?” is only half the value; “what would have prevented customer impact?” yields actionable resilience investments. Counterfactuals force you to evaluate safeguards: could alerts have fired earlier? Could rate limiting have reduced blast radius? Could feature flags have enabled graceful degradation? This shifts output from explanation to prevention. Importantly, counterfactuals also expose when root cause removal is not enough—e.g., even if you fix the bug, another bug could still cause outage unless safeguards exist.
EXAMPLE: If a cache stampede caused DB overload, the counterfactual might be “request coalescing” or “stale-while-revalidate” caching. If a bad deploy caused errors, the counterfactual might be “canary + automated rollback.” If a dependency failed, counterfactual could be “circuit breaker + fallback response.” Teams that include counterfactuals often reduce recurrence because they add layered defenses rather than only patching the specific bug that happened this time.
5. CAPA with Ownership and Deadlines Prevents “Postmortem Theater”
WHY IT WORKS: Postmortems often end with vague actions (“improve monitoring”) that never ship. CAPA (Corrective and Preventive Actions) becomes effective when actions are concrete, prioritized, assigned, and time-bound (24h, 30/60/90 days). Splitting corrective vs preventive ensures immediate stabilization plus long-term systemic improvement. Adding verification plans (“how will we know it’s fixed?”) prevents checkbox work. Operational excellence programs show that action specificity and accountability are the strongest predictors of reduced repeat incidents.
EXAMPLE: Instead of “add alerts,” specify: “Create alert: checkout 5xx rate > 1% for 5 minutes (SEV-2), route to on-call; add dependency latency alert for payment provider p95 > 800ms; add saturation alert for DB connections > 85% for 10 minutes.” Assign owners and due dates. For prevention: “Implement circuit breaker for payment provider with exponential backoff; canary deploy with auto-rollback; run quarterly load test.” Teams that do this typically see fewer “we already discussed this last time” frustrations and measurable stability improvements.
6. Verification Plans Close the Loop and Prevent Recurrence
WHY IT WORKS: Fixes can be incorrect or incomplete. Verification ensures you validate changes with tests, monitoring, and experiments. The verification plan defines: how to reproduce the failure, what metrics should change, what alarms should fire, and what success looks like. This is essential for confidence in high-stakes systems. Without verification, you risk “fixing the wrong thing” and learning only after the next incident.
EXAMPLE: If the incident involved DB connection exhaustion, verification might include: load test that simulates peak traffic; verify connection pool remains < 80%; verify p95 latency stays under target; verify circuit breaker trips properly when provider slows; verify auto-scaling triggers before saturation. In software reliability, teams that verify CAPA reduce repeat incidents substantially because they validate resilience rather than assuming it.
Example Output Preview
Sample: RCA for “Checkout Failures Spike” (Realistic Metrics)
Impact: 4.8% checkout requests failed (HTTP 502/504) for 118 minutes (10:02–12:00). Estimated 2,960 failed checkouts; projected lost revenue $142,000 (avg order $48). SLA breach: availability 99.71% for the day (target 99.9%).
Primary Failure Mode: Payment-service thread pool exhaustion due to retry storm against slow payment gateway; retries amplified load 3.2×. DB connection pool hit 100% at 10:07, causing cascading timeouts across checkout pipeline.
Contributing Factors: (1) No circuit breaker on payment gateway, (2) Retry policy lacked jitter/backoff, (3) Autoscaling triggered on CPU only (missed saturation), (4) No alert on dependency latency, (5) Email campaign caused traffic +22% without capacity review.
Counterfactual: Circuit breaker + capped retries would have reduced error rate to <0.5% and limited impact to <10 minutes; saturation alert on DB connections >85% would have detected 14 minutes earlier.
CAPA (Next 24h): Add capped retries (max 1), increase pool sizes safely, add alerts on dependency p95 > 800ms. 30/60/90: circuit breakers, bulkheads, canary deploys, quarterly load tests.
Prompt Chain Strategy
Step 1: RCA Draft + Hypotheses
Prompt: Use the main RCA prompt with timeline and evidence.
Expected Output: Full RCA report (5,000–7,000 words) with hypotheses, elimination logic, and CAPA.
Step 2: CAPA Implementation Plan
Prompt: “Turn the CAPA list into Jira-ready tasks: owner roles, estimates, dependencies, success criteria, and rollout plan.”
Expected Output: Execution plan that teams can implement immediately.
Step 3: Prevention via GameDay
Prompt: “Design a GameDay exercise to simulate recurrence. Include failure injection steps, expected alarms, and pass/fail.”
Expected Output: A resilience drill that validates prevention mechanisms.
Human-in-the-Loop Refinements
Require Evidence Tags for Every Claim
Force every causal statement to cite evidence (log line, metric, trace). Technique: tag claims as EVIDENCE / INFERENCE / ASSUMPTION. This reduces speculation and speeds consensus.
Run a “Disconfirming Evidence” Round
Ask each participant to present one piece of evidence that contradicts the leading hypothesis. Technique: 10-minute round-robin before finalizing root cause.
Separate “Trigger” From “Root Cause”
Triggers (traffic spike) are not the same as root causes (no rate limiting). Technique: document both so fixes address prevention, not just avoiding triggers.
Quantify Uncertainty Explicitly
When logs are missing, state confidence levels and what data would raise confidence. Technique: add a “missing evidence” section and a data collection action.
Assign CAPA Owners Before the Postmortem Ends
Convert learnings into action. Technique: every CAPA item must have an owner role + due date + verification metric.
Review Repeat Incidents Quarterly
Track recurrence themes (timeouts, retries, bad deploys). Technique: quarterly reliability review that aggregates RCAs into systemic roadmaps.