// AIOps Architecture Analysis

RCA Agent Architecture
Comparison Guide

Splunk · Dynatrace · Knowledge Graph · AWS · Bitbucket · Change Requests · Incidents

Architecture 01

Naive Fan-out with LLM Supervisor

// Spawn 20–30 agents → each calls all MCPs → supervisor synthesises

The honest truth: This is what most teams build first. 20 agents × 6 MCP servers = 120 parallel MCP calls per incident. Each agent loads ALL tool definitions upfront. One colleague reported 66,000 tokens consumed before typing the first prompt. At 20 agents, that's 1.32M tokens before any reasoning starts. For a P1 incident this is slow AND expensive AND often wrong because the supervisor context window fills up with redundant data.

Agents Active

20–30

Token Cost

~1.3M+

MCP Calls

120+

MTTR (P1)

15–30m

Accuracy

~35%

Cross-Agent Knowledge

None

// Flow Diagram — Naive Fan-out

Core Problems

N×M MCP calls: 25 agents × 6 MCPs = 150 calls, most duplicated
Context explosion: Supervisor sees 25 raw result dumps
Zero cross-agent learning: Agent 3 doesn't know what Agent 1 found
No early stopping: All agents run to completion even if answer found at T+2min
LLM supervisor bottleneck: Single point of synthesis = single point of failure
Wasted Haiku potential: Using Opus for simple log fetches

When it's OK

Truly independent parallel investigations with no overlap
Low-volume incident environments (<5/day)
When you don't own the infrastructure and can't implement caching
Prototype / proof-of-concept stage only

Architecture 02

Bidirectional Dijkstra

// Forward from symptom + backward from known-good → meet at root cause

Agents Active

8–12

Token Cost

~400K

MCP Calls

30–40

MTTR (P1)

8–12m

Accuracy

~55%

Cross-Agent Knowledge

Partial

// Flow Diagram — Bidirectional Dijkstra

What it solves

3x fewer agents vs naive fan-out — split work across two frontiers
Early termination when frontiers converge on same node
Natural separation of cause-chain vs blast-radius investigation
~70% fewer graph nodes explored vs single-direction traversal

Remaining blind spots

Static edge weights: Can't update graph priorities mid-run when evidence arrives
No cross-frontier signal: Upstream deploy finding doesn't change downstream focus
False convergence: Frontiers can meet at wrong node with no validation
No hypothesis scoring: Binary visited/not-visited, no confidence ranking

Architecture 03

Beam Search + Shared Blackboard

// Multi-hypothesis parallel beams with live evidence propagation between frontiers

Agents Active

6–10

Token Cost

~180K

MCP Calls

12–18

MTTR (P1)

4–7m

Accuracy

~80%

Cross-Agent Knowledge

Full

// Flow Diagram — Beam Search + Shared Blackboard

Key improvements

Blackboard enables cross-frontier learning — upstream deploy finding instantly redirects downstream agents via redirect_instructions
Beam pruning: Low-confidence beams killed early, agents redeployed
Demand-driven MCP: Only called when hypothesis requires specific signal
ACH validator: Forces active disproof, prevents hallucinated root causes
95.8% accuracy reported in MA-RCA paper with this pattern

Remaining limitations

Beams still explore neighbors uniformly — doesn't know which neighbor is more promising until it visits
Arbiter is a periodic check — slight lag before redirect propagates
Beam width is fixed: 3 beams may be too many for simple incidents, too few for very wide graphs

Architecture 04 — RECOMMENDED

Bidirectional A* + Live Heuristic Blackboard

// Evidence-weighted graph traversal where h(n) updates in real-time as agents discover signals

Agents Active

4–8

Token Cost

~90K

MCP Calls

6–10

MTTR (P1)

2–5m

Accuracy

~92%

Cross-Agent Knowledge

Real-time

// Flow Diagram — Bidirectional A* with Live Heuristic

★

Why A* wins over Beam Search: In Beam Search, agents still visit neighbors in an uninformed order until the Arbiter corrects them. With A*, each agent computes f(n)=g(n)+h(n) before choosing the next hop — so it never starts walking a dead-end path. The heuristic h(n) is computed from the live blackboard, meaning upstream findings automatically change what both frontiers prioritise without a separate arbiter step. This eliminates the lag between discovery and redirect.

What makes this different

h(n) is your ML model: Train w1-w4 weights on historical incidents from your Splunk/Dynatrace data
Agents are self-redirecting: No arbiter needed for mid-course corrections
Admissible heuristic = optimal path: If h never overestimates, A* is guaranteed to find root cause with minimum agent steps
Temporal anchor cuts ~70% of graph: h(n)=0 for nodes with no events in T±45min window
Haiku-only data fetching: Signal agents are Haiku 4.5, only spawned when h(n) gap detected

Design work required

Heuristic function design is not trivial — needs tuning per your environment
Admissibility check: If h overestimates it becomes suboptimal. Start conservative (underestimate)
Blackboard schema must be agreed and stable — schema drift breaks all agents
Cold start problem: w4 (historical hit rate) needs incident history to be meaningful

Full Comparison

Architecture Decision Matrix

// Honest numbers for your Splunk + Dynatrace + KG + AWS + Bitbucket + CR + Incident setup

Dimension	Naive Fan-out	Bidirectional Dijkstra	Beam + Blackboard	Bidirectional A* ★
Agents spawned (P1)	20–30	8–12	6–10	4–8
Token cost (est.)	1.3M+	~400K	~180K	~90K
MCP calls per incident	120+ (every agent, all servers)	30–40 (split by frontier)	12–18 (demand-driven)	6–10 (only when h(n) gap)
Cross-frontier knowledge sharing	None	None — independent frontiers	Via blackboard — arbiter lag	Real-time — h(n) auto-updates
Upstream finding redirects downstream	Never	Never	Yes, via arbiter (~30s lag)	Yes, instantly via h(n) recompute
False convergence risk	High (no cross-check)	High (no validation)	Low (ACH validator)	Very low (ACH + admissible h)
RCA accuracy (est.)	~35% (ReAct baseline)	~55%	~80–85%	~90–95%
MTTR for P1	15–30 min	8–12 min	4–7 min	2–5 min
LLM supervisor bottleneck	Yes — single choke point	Partial	No — arbiter is lightweight	No — agents self-direct
Graph traversal nodes visited	All N nodes	~√N nodes	3×depth nodes	Minimum possible
Handles dynamic edge weights	No	No — static graph	Partial (confidence scores)	Yes — core mechanism
Model tier for data fetching	Opus for everything	Sonnet + Opus	Haiku + Sonnet + Opus	Haiku + Sonnet + Opus
Implementation complexity	Low	Medium	Medium-high	High (heuristic design)
Requires historical incident data	No	No	No	For w4 weight only