The LLM Agent Debugging Problem. Building Observability from Scratch
A practical guide to debugging LLM agents, with structured logging utilities, trace visualization components, and replay infrastructure you can steal for your own projects.
ACTIVE_PHASE: PALLAV // 18 MIN READ
Last month I shipped an agent that helps sales teams query their CRM data. It worked great in testing. In production, it started generating SQL queries against tables that didn't exist, confidently formatting the empty results into polished summaries, and sending them to users. No errors anywhere. HTTP 200s across the board. The logs showed clean, successful completions.
It took me four hours to find the root cause. The agent's context window had silently dropped the database schema after a long conversation, so it started hallucinating table names. The tool call succeeded (the query ran, returned zero rows), and the agent interpreted "no results" as "no sales this quarter." A VP got a report showing zero revenue.
That experience broke something in my brain about how I approach agent reliability. I had structured logging. I had error tracking. None of it caught a failure that wasn't an error -- it was an agent making a reasonable decision from bad information.
Why Agent Bugs Don't Look Like Normal Bugs
Traditional software fails loudly. An uncaught exception, a 500 status code, a type error -- you get a stack trace pointing at the line. Agent failures are different. The system behaves exactly as designed at every individual step, but the emergent behavior is wrong.
Here's my taxonomy of agent failure modes, built from a year of production incidents:
| Failure Mode | What Happens | Why Logs Miss It |
|---|---|---|
| Context drift | Agent loses critical info as context window fills | Each step logs correctly; the missing context is invisible |
| Tool misrouting | Agent picks the wrong tool for the task | Tool call succeeds -- wrong tool, right execution |
| Hallucinated parameters | Agent invents plausible but wrong arguments | Arguments look valid; no schema validation failure |
| Reasoning collapse | Agent's chain-of-thought becomes circular or contradictory | Each thought is logged but no one reads 200 lines of reasoning |
| Silent degradation | Response quality drops gradually over a conversation | No single step is wrong -- quality is a gradient |
| Goal drift | Agent subtly reinterprets the original task | The agent is solving a problem, just not the problem |
Notice the pattern: every failure mode involves steps that succeed individually but compose into something broken. You need to see the trajectory, not individual frames.
The Mental Model: Agent State Machine
Before writing any debugging infrastructure, I needed a model for what "correct" agent behavior looks like. Every agent I've built or debugged follows the same lifecycle: IDLE -> THINKING -> ACTING -> OBSERVING -> (loop or DONE). Formalizing this turned vague bug reports ("the agent did something weird") into precise diagnoses ("it transitioned from THINKING to DONE without going through OBSERVING -- it ignored the tool result").
The value of this model is that it makes illegal transitions detectable at runtime. I encode it in the tracer as a transition table -- if the agent tries to go from THINKING to DONE when there are pending tool calls, or from ACTING to ACTING without an OBSERVING step in between, the tracer flags it immediately:
type AgentState = 'idle' | 'thinking' | 'acting' | 'observing' | 'done' | 'error';
// Valid transitions: from -> allowed next states
const TRANSITIONS: Record<AgentState, AgentState[]> = {
idle: ['thinking'],
thinking: ['acting', 'done', 'error'], // done = no tool calls needed
acting: ['observing', 'error'], // must observe tool results
observing: ['thinking', 'error'], // back to reasoning
done: [], // terminal
error: [], // terminal
};
export class AgentStateMachine {
private state: AgentState = 'idle';
private history: { from: AgentState; to: AgentState; spanId: string }[] = [];
transition(to: AgentState, spanId: string): { valid: boolean; violation?: string } {
const allowed = TRANSITIONS[this.state];
if (!allowed.includes(to)) {
const violation = `Illegal transition: ${this.state} -> ${to} at span ${spanId}. ` +
`Allowed from ${this.state}: [${allowed.join(', ')}]`;
this.history.push({ from: this.state, to, spanId });
return { valid: false, violation };
}
this.history.push({ from: this.state, to, spanId });
this.state = to;
return { valid: true };
}
// Detect loops: same state sequence repeating 3+ times
detectLoop(): { looping: boolean; pattern?: string } {
if (this.history.length < 6) return { looping: false };
const recent = this.history.slice(-6).map(h => h.to).join('->');
const half = this.history.slice(-3).map(h => h.to).join('->');
if (recent === `${half}->${half}`) {
return { looping: true, pattern: half };
}
return { looping: false };
}
}In the CRM bug, this would have caught the agent transitioning from THINKING to DONE at turn 31 -- exactly when the schema dropped out of context. The agent "decided" it was done because it couldn't see the tools anymore, not because it had an answer. The state machine makes this visible: a THINKING -> DONE transition with pending tool context is a red flag.
The Structured Logging Foundation
With the mental model in place, the tracer captures agent execution as a tree of spans (not a flat list). Each span knows its parent, its type, and carries typed metadata. The state machine is wired into every span creation, so illegal transitions are caught live.
import { randomUUID } from 'crypto';
type SpanKind = 'agent' | 'llm_call' | 'tool_call' | 'reasoning' | 'observation' | 'error';
interface Span {
id: string;
traceId: string;
parentId: string | null;
kind: SpanKind;
name: string;
startTime: number;
endTime?: number;
status: 'running' | 'ok' | 'error';
input?: unknown;
output?: unknown;
meta: Record<string, unknown>;
children: Span[];
}
export class AgentTracer {
private spans: Map<string, Span> = new Map();
private traceId: string;
private stateMachine = new AgentStateMachine();
private onFlush?: (spans: Span[]) => void;
constructor(opts?: { traceId?: string; onFlush?: (spans: Span[]) => void }) {
this.traceId = opts?.traceId ?? randomUUID();
this.onFlush = opts?.onFlush;
}
startSpan(kind: SpanKind, name: string, parentId?: string | null, input?: unknown): Span {
const span: Span = {
id: randomUUID(),
traceId: this.traceId,
parentId: parentId ?? null,
kind, name, input,
startTime: Date.now(),
status: 'running',
meta: {},
children: [],
};
// Validate state transition
const agentState = spanKindToState(kind);
if (agentState) {
const result = this.stateMachine.transition(agentState, span.id);
if (!result.valid) {
span.meta.stateViolation = result.violation;
console.warn(`[AgentTracer] ${result.violation}`);
}
}
this.spans.set(span.id, span);
if (parentId) {
this.spans.get(parentId)?.children.push(span);
}
return span;
}
endSpan(id: string, output?: unknown, status: 'ok' | 'error' = 'ok'): void {
const span = this.spans.get(id);
if (!span) return;
span.endTime = Date.now();
span.output = output;
span.status = status;
}
// Capture context window state at every LLM call
recordContextWindow(spanId: string, messages: { role: string; tokens: number }[]): void {
const span = this.spans.get(spanId);
if (!span) return;
const total = messages.reduce((sum, m) => sum + m.tokens, 0);
span.meta.contextWindowUtilization = total / 128_000; // adjust per model
span.meta.contextMessages = messages.length;
span.meta.contextTokens = total;
}
getTrace(): Span[] {
return Array.from(this.spans.values()).filter(s => s.parentId === null);
}
flush(): void { this.onFlush?.(Array.from(this.spans.values())); }
}
function spanKindToState(kind: SpanKind): AgentState | null {
const map: Partial<Record<SpanKind, AgentState>> = {
llm_call: 'thinking', tool_call: 'acting', observation: 'observing',
};
return map[kind] ?? null;
}The context window field matters most
That contextWindowUtilization field has caught more bugs than everything else combined. When it crosses 0.8, you're in the danger zone for context drift. When it crosses 0.9, the agent should compress or summarize -- never silently truncate.
Wiring it into an agent loop is straightforward. The tracer wraps every LLM call and tool call, recording context window state at each step:
const tracer = new AgentTracer({
onFlush: (spans) => saveToDatabase(spans),
});
async function agentLoop(query: string): Promise<string> {
const rootSpan = tracer.startSpan('agent', 'sales-query-agent', null, { query });
try {
let messages = [systemPrompt, { role: 'user', content: query }];
for (let turn = 0; turn < 10; turn++) {
const llmSpan = tracer.startSpan('llm_call', `turn-${turn}`, rootSpan.id);
tracer.recordContextWindow(llmSpan.id, messages.map(m => ({
role: m.role, tokens: estimateTokens(m.content),
})));
const response = await llm.chat(messages);
tracer.endSpan(llmSpan.id, response.content);
if (!response.toolCalls?.length) {
tracer.endSpan(rootSpan.id, response.content);
tracer.flush();
return response.content;
}
for (const call of response.toolCalls) {
const toolSpan = tracer.startSpan('tool_call', call.name, rootSpan.id, call.arguments);
try {
const result = await executeTool(call.name, call.arguments);
tracer.endSpan(toolSpan.id, result);
messages.push({ role: 'tool', content: JSON.stringify(result) });
} catch (err) {
tracer.endSpan(toolSpan.id, { error: String(err) }, 'error');
}
}
}
tracer.endSpan(rootSpan.id, 'max iterations', 'error');
tracer.flush();
return 'Unable to complete the task.';
} catch (err) {
tracer.endSpan(rootSpan.id, { error: String(err) }, 'error');
tracer.flush();
throw err;
}
}The estimateTokens function used above is a rough heuristic -- split on whitespace and divide by 0.75 (English averages ~0.75 tokens per word). For exact counts, use tiktoken or your model provider's tokenizer. The heuristic is good enough for context window monitoring since you're watching for 80%+ utilization, not counting exact tokens.
// Good enough for monitoring. Use tiktoken for exact counts.
function estimateTokens(text: string): number {
if (!text) return 0;
return Math.ceil(text.split(/\s+/).length / 0.75);
}Don't Log Full Prompts in Production
This deserves its own section because I've seen it go wrong at multiple companies. In development, log everything -- full prompts, full responses, full tool results. In production, log summaries and token counts only.
Three reasons: (1) User prompts contain PII. Logging them to your observability backend means your Datadog or Elastic instance now holds customer data subject to GDPR/CCPA, which your compliance team probably didn't sign off on. (2) Full prompts at scale are expensive to store. A busy agent generating 100K traces/day with 128K-token context windows is 12TB/day of raw log data. (3) Full prompt logs slow down your log pipeline and make searching slower for everyone.
// Use a log level flag, not a blanket policy
const LOG_LEVEL = process.env.AGENT_LOG_LEVEL ?? 'summary'; // full | summary | minimal
function logLLMCall(span: Span, messages: Message[]) {
if (LOG_LEVEL === 'full') {
span.input = messages; // dev only
} else if (LOG_LEVEL === 'summary') {
span.input = {
messageCount: messages.length,
roles: messages.map(m => m.role),
lastUserMessage: messages.findLast(m => m.role === 'user')?.content?.slice(0, 100),
tokenEstimate: messages.reduce((sum, m) => sum + estimateTokens(m.content), 0),
};
}
// minimal: no input logged at all
}Existing Tools and Their Design Trade-offs
I've used the major agent observability tools in production. Here's my assessment, but with an important caveat: the "gaps" I describe are often deliberate design choices, not oversights.
| Tool | Good At | Intentional Trade-off | Best For |
|---|---|---|---|
| LangSmith | Chain tracing, eval datasets, prompt versioning | Flat call list over tree view -- stays framework-agnostic | LangChain-native projects |
| Arize Phoenix | Model-agnostic tracing, drift detection | Generic span model over agent-specific patterns | ML teams with existing Arize infra |
| LangFuse | Open source, clean trace UI, cost tracking | Broad compatibility over deep agent introspection | Teams that want self-hosted |
| Braintrust | Eval-first workflow, scoring | Eval focus over live debugging | Systematic evaluation pipelines |
| Custom (this post) | Context inspection, replay, state validation | Setup cost, maintenance burden, team onboarding | Teams with unique agent architectures |
These tools treat agent execution as flat LLM call lists partly by design -- to support any model, any framework, any orchestration pattern. That's a reasonable trade-off. The cost of custom tooling is real: you're maintaining debugging infrastructure instead of using it. For most teams, LangSmith or LangFuse covers 80% of what you need. The custom approach is worth it when you have agent architectures that don't map cleanly to call-list models -- multi-agent systems, agents with persistent state, or the kind of context-window bugs I described above.
Replay: Time-Travel Debugging for Agents
The most powerful technique I've found is replay -- taking a recorded trace and re-executing it with modifications. Want to know what happens if the database tool returns different data? Replay with a mocked tool response. Want to test a prompt change? Replay with the new system prompt and compare.
The core idea: walk the recorded span list, replaying each step from the original trace. When you override a tool result, set a diverged flag. After divergence, LLM calls must be re-executed (the model now sees different context), but tool calls can still use overrides or originals. Non-diverged steps replay from recorded data without hitting any APIs.
interface ReplayOptions {
toolOverrides?: Map<string, unknown>; // spanId -> mock result
systemPromptOverride?: string; // replace system prompt and force re-execution
breakpointSpanId?: string; // pause at this span
onStep?: (span: Span, state: ReplayState) => Promise<'continue' | 'stop'>;
}
interface ReplayState {
messages: Array<{ role: string; content: string }>;
currentSpanIndex: number;
totalSpans: number;
divergedFromOriginal: boolean;
}
export async function replayTrace(
originalSpans: Span[],
options: ReplayOptions = {},
): Promise<{ result: string; diffs: SpanDiff[] }> {
const flatSpans = flattenSpans(originalSpans);
const diffs: SpanDiff[] = [];
let messages: Array<{ role: string; content: string }> = [];
let diverged = false;
let replayedFirstLLM = false;
// If system prompt is overridden, we diverge from the start
if (options.systemPromptOverride) {
messages.push({ role: 'system', content: options.systemPromptOverride });
diverged = true;
}
for (let i = 0; i < flatSpans.length; i++) {
const original = flatSpans[i];
if (options.breakpointSpanId === original.id) break;
// Step-through callback: lets UI pause between spans
if (options.onStep) {
const action = await options.onStep(original, {
messages,
currentSpanIndex: i,
totalSpans: flatSpans.length,
divergedFromOriginal: diverged,
});
if (action === 'stop') break;
}
// Re-run the first LLM call with the overridden system prompt.
// Uses a flag instead of i === 0 because the first span may be
// an 'agent' or 'reasoning' span, not an llm_call.
if (options.systemPromptOverride && !replayedFirstLLM && original.kind === 'llm_call') {
replayedFirstLLM = true;
const response = await llm.chat(messages);
diffs.push({ spanId: original.id, field: 'output', original: original.output, replayed: response.content });
continue;
}
if (original.kind === 'tool_call') {
const override = options.toolOverrides?.get(original.id);
if (override !== undefined) {
messages.push({ role: 'tool', content: JSON.stringify(override) });
diffs.push({ spanId: original.id, field: 'output', original: original.output, replayed: override });
diverged = true;
continue;
}
}
if (original.kind === 'llm_call' && diverged) {
// After divergence, must re-run -- cached output reflects old context
const response = await llm.chat(messages);
if (response.content !== original.output) {
diffs.push({ spanId: original.id, field: 'output', original: original.output, replayed: response.content });
}
continue;
}
// No divergence: replay from recorded data
if (original.output) {
messages.push({
role: original.kind === 'tool_call' ? 'tool' : 'assistant',
content: String(original.output),
});
}
}
return { result: messages.at(-1)?.content ?? '', diffs };
}Known limitation: span ordering
The replay function flattens the span tree with a depth-first traversal. If the original trace has interleaved tool calls from different agent turns (e.g., parallel tool execution), the reconstructed messages array may not match the original order. For agents with parallel tool calling, you need to reconstruct messages by timestamp rather than tree position. This implementation handles sequential tool execution correctly, which covers most single-agent architectures.
The onStep callback turns this into a step-through debugger. In the trace viewer UI, wire it to a "next step" button that resolves the promise -- IDE-style stepping through an agent's execution history.
Trace Assertions
Structured traces enable automated assertions -- checks that run after every agent execution and flag problems before they reach users. I chose these six because they map directly to the failure taxonomy above: context drift (assertion 1), reasoning collapse and tool loops (2), fire-and-forget tool bugs (3), runaway agents (4, 5), and illegal state transitions (6). The thresholds are defaults; tune them per agent based on your workload.
interface AssertionConfig {
maxContextUtilization: number; // default: 0.9
maxDuplicateToolCalls: number; // default: 3
maxLLMCalls: number; // default: 10 (a 15-step workflow agent needs higher)
maxDurationMs: number; // default: 30_000
}
const DEFAULTS: AssertionConfig = {
maxContextUtilization: 0.9,
maxDuplicateToolCalls: 3,
maxLLMCalls: 10,
maxDurationMs: 30_000,
};
export function runAssertions(
spans: Span[],
config: Partial<AssertionConfig> = {},
): AssertionResult[] {
const cfg = { ...DEFAULTS, ...config };
const flat = flattenSpans(spans);
const results: AssertionResult[] = [];
// 1. Context window headroom
const maxCtx = Math.max(
...flat.filter(s => s.meta.contextWindowUtilization)
.map(s => s.meta.contextWindowUtilization as number), 0
);
results.push({
name: 'context-window-headroom',
passed: maxCtx < cfg.maxContextUtilization,
message: `Peak context: ${Math.round(maxCtx * 100)}%` +
(maxCtx >= cfg.maxContextUtilization ? ' -- high risk of context drift' : ''),
severity: 'error',
});
// 2. No tool call loops (same tool + same args repeated)
const toolSigs = flat.filter(s => s.kind === 'tool_call')
.map(s => `${s.name}:${JSON.stringify(s.input)}`);
const maxDupes = Math.max(
...Array.from(new Set(toolSigs)).map(sig => toolSigs.filter(s => s === sig).length), 0
);
results.push({
name: 'no-tool-loops',
passed: maxDupes < cfg.maxDuplicateToolCalls,
message: maxDupes >= cfg.maxDuplicateToolCalls
? `Tool called ${maxDupes}x with identical args -- possible loop` : 'No loops detected',
severity: 'error',
});
// 3. No orphaned tool calls (started but never completed)
const orphaned = flat.filter(s => s.kind === 'tool_call' && (!s.endTime || s.status === 'running'));
results.push({
name: 'no-orphaned-tools',
passed: orphaned.length === 0,
message: orphaned.length > 0 ? `${orphaned.length} tool calls never completed` : 'All tools completed',
severity: 'error',
});
// 4. LLM call budget
const llmCount = flat.filter(s => s.kind === 'llm_call').length;
results.push({
name: 'llm-call-budget',
passed: llmCount <= cfg.maxLLMCalls,
message: `${llmCount} LLM calls` + (llmCount > cfg.maxLLMCalls ? ` -- exceeds budget of ${cfg.maxLLMCalls}` : ''),
severity: 'warning',
});
// 5. Execution time
const root = flat.find(s => s.kind === 'agent');
const duration = root?.endTime ? root.endTime - root.startTime : Infinity;
results.push({
name: 'execution-time',
passed: duration < cfg.maxDurationMs,
message: `${Math.round(duration / 1000)}s` +
(duration >= cfg.maxDurationMs ? ` -- exceeds ${cfg.maxDurationMs / 1000}s budget` : ''),
severity: 'warning',
});
// 6. State machine violations (from tracer)
const violations = flat.filter(s => s.meta.stateViolation);
results.push({
name: 'no-state-violations',
passed: violations.length === 0,
message: violations.length > 0
? `${violations.length} illegal state transitions detected`
: 'All state transitions valid',
severity: 'error',
});
return results;
}I run these in two places: as a post-execution check in development (fail the test if any assertion fires), and as an async monitor in production (pipe to alerting). The loop detection alone has caught three production incidents where an agent retried a failing tool with the same parameters.
Note the configurable thresholds. A CRM query agent with 3-4 tool calls needs different budgets than a research agent that legitimately makes 20 LLM calls across multiple sources. Default to strict, then loosen per agent when you have data showing the higher thresholds are expected behavior.
The Debugging Dashboard
After several iterations, I converged on a three-panel layout: trace tree (where am I?), detail pane (what happened here?), and context inspector (what did the agent know?). The context window panel on the right is the piece most tools are missing. In the CRM bug, it would have immediately shown the schema message being truncated at turn 31.
I won't include the full TraceViewer React component here -- it's a standard collapsible tree renderer with span-kind color coding, duration labels, and inline JSON inspection. The interesting parts are the context window warning badges (red at >90%, amber at >80%) and the state violation markers.
Log Entry Schema
Each entry type serves a different debugging purpose. The context.snapshot entry is the one I wish every agent framework emitted by default -- it's the fastest way to diagnose context drift.
| Entry Type | Key Fields | Debugging Use |
|---|---|---|
agent.start | traceId, query, availableTools[] | Reconstruct initial conditions |
llm.request | spanId, messageCount, model, temperature, contextUtil | What the model saw (summary, not full prompt) |
llm.response | spanId, content summary, toolCalls[], usage, latency | What the model produced and at what cost |
tool.call | spanId, toolName, arguments, parentSpanId | Verify the agent chose the right tool |
tool.result | spanId, result summary, duration, status | Check tool output format and content |
context.snapshot | spanId, tokenCounts, messageCount, utilization, truncatedMessages[] | Detect context drift before it causes problems |
state.transition | spanId, from, to, valid, violation? | Catch illegal state machine transitions |
agent.end | traceId, result summary, totalDuration, assertionResults[] | Overall execution summary with automated checks |
Patterns That Pay Off
1. Record everything, display selectively
Capture full execution traces always. Build your UI to show summaries by default and let developers drill into details on demand. Storage is cheap; re-running a failing agent to capture missing data is not.
2. Make context window a first-class metric
Track context window utilization like you track CPU usage. Set alerts at 80%. Log what got truncated. Every context drift bug I've seen was predictable from utilization metrics.
3. Diff traces, not outputs
When comparing a working run to a broken run, don't just diff the final output. Diff the traces span-by-span. The first span where inputs or outputs diverge points at the root cause.
4. Build replay into your pipeline from day one
Retrofitting replay into an existing agent is painful. The trace format, the tool abstraction layer, the deterministic execution mode -- these are much easier to build at the start than to bolt on at 2 AM during an incident.
Open Problems I Haven't Solved
Rather than a generic roadmap, here are the specific problems I'm still working through:
- Replay with parallel tool calls. The current replay function assumes sequential tool execution. When agents call multiple tools in parallel (common with function-calling models), the span ordering becomes ambiguous. I've tried timestamp-based reconstruction but it introduces race conditions in the replay that weren't in the original execution.
- Cross-agent trace linking. When Agent A delegates to Agent B, the traces are separate. Parent-child linking across agent boundaries sounds simple, but the agents often run in different processes or even different services. Propagating a trace ID through tool-call boundaries without coupling the agents is unsolved in my current setup.
- Anomaly detection without labeled data. I can detect hard failures (assertion violations, state machine errors). Detecting soft failures -- response quality that's subtly worse than usual -- requires baselines I don't have. Statistical approaches (comparing context utilization distributions across runs) show promise but generate too many false positives to be useful in alerting.
- Cost attribution that's actually useful. I track total token spend per trace, but what I really want is cost-per-user-query broken down by which tool calls were productive vs. wasted. An agent that takes 3 attempts to get a SQL query right costs 3x -- but my current cost tracking just shows the total, not that the first two attempts were thrown away.
The code in this post is extracted from production systems. The AgentTracer, the state machine, the assertions, and the replay function are designed to be dropped into any TypeScript agent codebase with minimal modification. Use them as a starting point, then adapt the thresholds and state transitions to match your agent's actual behavior.