Skip to main content
llmagentsobservability
25 min read

The LLM Agent Debugging Problem: Building Observability from Scratch

A practical guide to debugging LLM agents, with structured logging utilities, trace visualization components, and replay infrastructure you can steal for your own projects.

By Pallav

Picture an agent that answers questions by querying a database. It works great in testing. Then, running for real, it starts generating SQL queries against tables that don't exist, confidently formatting the empty results into polished summaries, and handing them back to whoever asked. No errors anywhere. HTTP 200s across the board. The logs show clean, successful completions.

The root cause hides for hours. The agent's context window has silently dropped the database schema after a long conversation, so it starts hallucinating table names. The tool call succeeds (the query runs, returns zero rows), and the agent interprets "no results" as "nothing happened this period." What comes back is a confident, polished summary built on a table that never existed.

This kind of failure changes how you think about agent reliability. Structured logging? Present. Error tracking? Present. None of it catches a failure that isn't an error -- it's an agent making a reasonable decision from bad information.


Why Agent Bugs Don't Look Like Normal Bugs

Traditional software fails loudly. An uncaught exception, a 500 status code, a type error -- you get a stack trace pointing at the line. Agent failures are different. The system behaves exactly as designed at every individual step, but the emergent behavior is wrong.

Here's a taxonomy of agent failure modes, drawn from the ways these systems tend to break:

Failure ModeWhat HappensWhy Logs Miss It
Context driftAgent loses critical info as context window fillsEach step logs correctly; the missing context is invisible
Tool misroutingAgent picks the wrong tool for the taskTool call succeeds -- wrong tool, right execution
Hallucinated parametersAgent invents plausible but wrong argumentsArguments look valid; no schema validation failure
Reasoning collapseAgent's chain-of-thought becomes circular or contradictoryEach thought is logged but no one reads 200 lines of reasoning
Silent degradationResponse quality drops gradually over a conversationNo single step is wrong -- quality is a gradient
Goal driftAgent subtly reinterprets the original taskThe agent is solving a problem, just not the problem

Notice the pattern: every failure mode involves steps that succeed individually but compose into something broken. You need to see the trajectory, not individual frames.

LLM Callprompt + contextReasoningchain-of-thoughtTool Callsql_query()Observation[] (empty rows)LLM Call #2interpret resultReasoning"nothing in Q3"Response"zero results"BUG: context lost schemahallucinated table -> 0 rows -> wrong answerWHAT TRADITIONAL LOGS SHOW:OK [INFO] LLM call completed (1.2s, 3847 tokens)OK [INFO] Tool sql_query executed successfully (0.3s)OK [INFO] Response generated (0.8s, 512 tokens)No errors. No warnings. All green. Completely wrong output.
An agent execution trace showing how a context drift bug produces correct-looking logs at every step while generating a completely wrong answer.

The Mental Model: Agent State Machine

Before writing any debugging infrastructure, I needed a model for what "correct" agent behavior looks like. Every agent I've built or debugged follows the same lifecycle: IDLE -> THINKING -> ACTING -> OBSERVING -> (loop or DONE). Formalizing this turned vague bug reports ("the agent did something weird") into precise diagnoses ("it transitioned from THINKING to DONE without going through OBSERVING -- it ignored the tool result").

IDLETHINKINGACTINGOBSERVINGDONEERRORquerytool callexecuteno tool callexceptiontool errorresult -> next turnCommon bugs: stuck in THINKING->ACTING loop (no progress) | skips OBSERVING (ignores tool results) | THINKING->DONE (premature completion)Healthy pattern: IDLE -> [THINKING -> ACTING -> OBSERVING]* -> THINKING -> DONE (exits via reasoning, not error)
Agent lifecycle state machine. Most production bugs involve unexpected transitions -- skipping observation, looping between thinking and acting, or exiting prematurely.

The value of this model is that it makes illegal transitions detectable at runtime. I encode it in the tracer as a transition table -- if the agent tries to go from THINKING to DONE when there are pending tool calls, or from ACTING to ACTING without an OBSERVING step in between, the tracer flags it immediately:

Two things to look for in the code below: the TRANSITIONS map (the entire correctness model lives in those six lines), and detectLoop, which uses a sliding window over the transition history to catch the most common pathology -- the same three-state cycle repeating without progress.

lib/state-machine.ts
typescript
type AgentState = 'idle' | 'thinking' | 'acting' | 'observing' | 'done' | 'error';

// Valid transitions: from -> allowed next states
const TRANSITIONS: Record<AgentState, AgentState[]> = {
  idle:      ['thinking'],
  thinking:  ['acting', 'done', 'error'],  // done = no tool calls needed
  acting:    ['observing', 'error'],        // must observe tool results
  observing: ['thinking', 'error'],         // back to reasoning
  done:      [],                            // terminal
  error:     [],                            // terminal
};

export class AgentStateMachine {
  private state: AgentState = 'idle';
  private history: { from: AgentState; to: AgentState; spanId: string }[] = [];

  transition(to: AgentState, spanId: string): { valid: boolean; violation?: string } {
    const allowed = TRANSITIONS[this.state];

    if (!allowed.includes(to)) {
      const violation = `Illegal transition: ${this.state} -> ${to} at span ${spanId}. ` +
        `Allowed from ${this.state}: [${allowed.join(', ')}]`;
      this.history.push({ from: this.state, to, spanId });
      return { valid: false, violation };
    }

    this.history.push({ from: this.state, to, spanId });
    this.state = to;
    return { valid: true };
  }

  // Detect loops: same state sequence repeating 3+ times
  detectLoop(): { looping: boolean; pattern?: string } {
    if (this.history.length < 6) return { looping: false };

    const recent = this.history.slice(-6).map(h => h.to).join('->');
    const half = this.history.slice(-3).map(h => h.to).join('->');
    if (recent === `${half}->${half}`) {
      return { looping: true, pattern: half };
    }
    return { looping: false };
  }
}

In the schema-drift bug above, this would have caught the agent transitioning from THINKING to DONE at turn 31 -- exactly when the schema dropped out of context. The agent "decided" it was done because it couldn't see the tools anymore, not because it had an answer. The state machine makes this visible: a THINKING -> DONE transition with pending tool context is a red flag.


The Structured Logging Foundation

The span tree and the tracer

With the mental model in place, the tracer captures agent execution as a tree of spans (not a flat list). Each span knows its parent, its type, and carries typed metadata. The state machine is wired into every span creation, so illegal transitions are caught live.

Two methods are doing the load-bearing work below. startSpan is where the state machine is enforced -- every new span checks the transition is legal before being recorded. recordContextWindow is the small method that's caught more production bugs than everything else combined: it stores per-call utilization on the span itself, so context drift becomes a queryable property of the trace rather than something you have to reconstruct after the fact. Treat the contextWindowUtilization field as a first-class metric -- crossing 0.8 is the danger zone, and crossing 0.9 should trigger compression or summarization, never silent truncation.

lib/agent-logger.ts
typescript
import { randomUUID } from 'crypto';

type SpanKind = 'agent' | 'llm_call' | 'tool_call' | 'reasoning' | 'observation' | 'error';

interface Span {
  id: string;
  traceId: string;
  parentId: string | null;
  kind: SpanKind;
  name: string;
  startTime: number;
  endTime?: number;
  status: 'running' | 'ok' | 'error';
  input?: unknown;
  output?: unknown;
  meta: Record<string, unknown>;
  children: Span[];
}

export class AgentTracer {
  private spans: Map<string, Span> = new Map();
  private traceId: string;
  private stateMachine = new AgentStateMachine();
  private onFlush?: (spans: Span[]) => void;

  constructor(opts?: { traceId?: string; onFlush?: (spans: Span[]) => void }) {
    this.traceId = opts?.traceId ?? randomUUID();
    this.onFlush = opts?.onFlush;
  }

  startSpan(kind: SpanKind, name: string, parentId?: string | null, input?: unknown): Span {
    const span: Span = {
      id: randomUUID(),
      traceId: this.traceId,
      parentId: parentId ?? null,
      kind, name, input,
      startTime: Date.now(),
      status: 'running',
      meta: {},
      children: [],
    };

    // ── LOOK HERE ── State machine enforcement.
    // Every new span checks the transition is legal before being recorded.
    // Illegal transitions are flagged on the span itself (not thrown) so the
    // trace stays intact and the violation is queryable after the fact.
    const agentState = spanKindToState(kind);
    if (agentState) {
      const result = this.stateMachine.transition(agentState, span.id);
      if (!result.valid) {
        span.meta.stateViolation = result.violation;
        console.warn(`[AgentTracer] ${result.violation}`);
      }
    }

    this.spans.set(span.id, span);
    if (parentId) {
      this.spans.get(parentId)?.children.push(span);
    }
    return span;
  }

  endSpan(id: string, output?: unknown, status: 'ok' | 'error' = 'ok'): void {
    const span = this.spans.get(id);
    if (!span) return;
    span.endTime = Date.now();
    span.output = output;
    span.status = status;
  }

  // ── LOOK HERE ── This is the bug-catcher.
  // Storing per-call utilization on the span turns context drift into a
  // queryable property of the trace. Every assertion, alert, and dashboard
  // panel that catches context-related bugs reads from this single field.
  recordContextWindow(spanId: string, messages: { role: string; tokens: number }[]): void {
    const span = this.spans.get(spanId);
    if (!span) return;
    const total = messages.reduce((sum, m) => sum + m.tokens, 0);
    span.meta.contextWindowUtilization = total / 128_000; // adjust per model
    span.meta.contextMessages = messages.length;
    span.meta.contextTokens = total;
  }

  getTrace(): Span[] {
    return Array.from(this.spans.values()).filter(s => s.parentId === null);
  }

  flush(): void { this.onFlush?.(Array.from(this.spans.values())); }
}

function spanKindToState(kind: SpanKind): AgentState | null {
  const map: Partial<Record<SpanKind, AgentState>> = {
    llm_call: 'thinking', tool_call: 'acting', observation: 'observing',
  };
  return map[kind] ?? null;
}

Wiring it into the agent loop

The tracer is only useful if it's threaded through every LLM call and every tool call -- no exceptions. The loop below is the minimum viable wiring: a root agent span at the top, an llm_call span on every turn (with recordContextWindow called before the model is invoked, so utilization is captured against the prompt that actually went out), and a tool_call span around each tool execution. Two things to look for: the endSpan calls in both the success and exception paths (every span must close, or the trace tree is corrupt), and the recordContextWindow call inside the turn loop -- that single line is what makes context drift detectable in the first place.

lib/agent-loop.ts
typescript
const tracer = new AgentTracer({
  onFlush: (spans) => saveToDatabase(spans),
});

async function agentLoop(query: string): Promise<string> {
  const rootSpan = tracer.startSpan('agent', 'data-query-agent', null, { query });

  try {
    let messages = [systemPrompt, { role: 'user', content: query }];

    for (let turn = 0; turn < 10; turn++) {
      const llmSpan = tracer.startSpan('llm_call', `turn-${turn}`, rootSpan.id);

      // ── LOOK HERE ── Capture context BEFORE the LLM call.
      // Recording utilization against the prompt that's about to be sent
      // (not the response that comes back) is what makes context drift
      // detectable -- you see exactly what the model saw at each turn.
      tracer.recordContextWindow(llmSpan.id, messages.map(m => ({
        role: m.role, tokens: estimateTokens(m.content),
      })));

      const response = await llm.chat(messages);
      tracer.endSpan(llmSpan.id, response.content);

      if (!response.toolCalls?.length) {
        tracer.endSpan(rootSpan.id, response.content);
        tracer.flush();
        return response.content;
      }

      for (const call of response.toolCalls) {
        const toolSpan = tracer.startSpan('tool_call', call.name, rootSpan.id, call.arguments);
        try {
          const result = await executeTool(call.name, call.arguments);
          tracer.endSpan(toolSpan.id, result);
          messages.push({ role: 'tool', content: JSON.stringify(result) });
        } catch (err) {
          // ── LOOK HERE ── Every span MUST close, even on failure.
          // Orphaned spans corrupt the trace tree and break replay --
          // 'no-orphaned-tools' is a first-class assertion further down.
          tracer.endSpan(toolSpan.id, { error: String(err) }, 'error');
        }
      }
    }

    tracer.endSpan(rootSpan.id, 'max iterations', 'error');
    tracer.flush();
    return 'Unable to complete the task.';
  } catch (err) {
    tracer.endSpan(rootSpan.id, { error: String(err) }, 'error');
    tracer.flush();
    throw err;
  }
}

The estimateTokens helper used above is a one-liner: Math.ceil(text.split(/\s+/).length / 0.75). It's a rough heuristic -- English averages ~0.75 words per token (≈1.33 tokens per word) -- and that's good enough for context-window monitoring, since you're watching for 80%+ utilization, not counting exact tokens. For exact counts, swap in tiktoken or your model provider's tokenizer.


Don't Log Full Prompts in Production

This deserves its own section because it goes wrong so easily. In development, log everything -- full prompts, full responses, full tool results. In production, log summaries and token counts only.

Three reasons: (1) User prompts contain PII. Logging them to your observability backend means your Datadog or Elastic instance now holds customer data subject to GDPR/CCPA, which your compliance team probably didn't sign off on. (2) Full prompts at scale are expensive to store. A busy agent generating 100K traces/day with 128K-token context windows is 12TB/day of raw log data. (3) Full prompt logs slow down your log pipeline and make searching slower for everyone.

typescript
// Use a log level flag, not a blanket policy
const LOG_LEVEL = process.env.AGENT_LOG_LEVEL ?? 'summary'; // full | summary | minimal

function logLLMCall(span: Span, messages: Message[]) {
  if (LOG_LEVEL === 'full') {
    span.input = messages; // dev only
  } else if (LOG_LEVEL === 'summary') {
    span.input = {
      messageCount: messages.length,
      roles: messages.map(m => m.role),
      lastUserMessage: messages.findLast(m => m.role === 'user')?.content?.slice(0, 100),
      tokenEstimate: messages.reduce((sum, m) => sum + estimateTokens(m.content), 0),
    };
  }
  // minimal: no input logged at all
}

Log entry schema

Here's the entry schema the tracer emits, with each type serving a different debugging purpose. The context.snapshot entry is the one I wish every agent framework emitted by default -- it's the fastest way to diagnose context drift.

Entry TypeKey FieldsDebugging Use
agent.starttraceId, query, availableTools[]Reconstruct initial conditions
llm.requestspanId, messageCount, model, temperature, contextUtilWhat the model saw (summary, not full prompt)
llm.responsespanId, content summary, toolCalls[], usage, latencyWhat the model produced and at what cost
tool.callspanId, toolName, arguments, parentSpanIdVerify the agent chose the right tool
tool.resultspanId, result summary, duration, statusCheck tool output format and content
context.snapshotspanId, tokenCounts, messageCount, utilization, truncatedMessages[]Detect context drift before it causes problems
state.transitionspanId, from, to, valid, violation?Catch illegal state machine transitions
agent.endtraceId, result summary, totalDuration, assertionResults[]Overall execution summary with automated checks

Existing Tools and Their Design Trade-offs

I've used the major agent observability tools in production. Here's my honest take on where each one stopped being enough for the kinds of bugs I was hitting -- and what I had to build around it.

ToolWhat it does wellWhere it fell short for me
LangSmithChain tracing, eval datasets, prompt versioningTreats every step as a flat LLM call. When an agent re-entered the same THINKING -> ACTING loop four times, the UI showed eight rows of green checkmarks -- no concept of 'this is the same step repeating' or 'context is degrading turn-over-turn.' I had to export the trace JSON and grep for the loop manually.
Arize PhoenixModel-agnostic tracing, drift detectionThe generic span model means tool-call arguments and LLM context land in the same attributes blob. Filtering 'show me only the spans where context utilization crossed 80%' wasn't possible without writing a custom processor on top -- and at that point I was rebuilding half my own tracer anyway.
LangFuseOpen source, clean trace UI, cost trackingCost tracking is excellent, but the trace viewer doesn't surface what was in context at each LLM call -- only the messages sent. For the schema-drift bug, the question 'when did the schema fall out?' is unanswerable from the LangFuse UI. You see the prompt that was sent; you don't see that it's missing 4,000 tokens of system context that should have been there.
BraintrustEval-first workflow, scoringBuilt for offline scoring runs, not live incident debugging. When a user reported a bad agent response on a Tuesday afternoon, the workflow for 'pull this specific trace, replay it with a tweaked prompt, see what changes' was multi-step and async. I needed something I could click through in 30 seconds.
Custom (this post)Context inspection, replay, state validationYou pay for it: roughly two engineer-weeks of upfront work, then a few hours a month maintaining the schema as agents evolve. Worth it for me because the failure modes I care about (context drift, illegal transitions, replay-with-overrides) aren't first-class in any of the above. Not worth it if your agents are simple LangChain chains.

None of this is a knock on the tools. LangSmith and LangFuse are excellent at what they're designed for, and for most teams they cover the 80% case. The custom approach is only worth it when you've hit a wall with the off-the-shelf options on the specific failure modes you keep seeing -- multi-agent traces, context-window bugs, or replay-driven debugging. If you're shipping a single LangChain agent that mostly works, stop reading and go install LangFuse.


Replay: Time-Travel Debugging for Agents

The most powerful technique I've found is replay -- taking a recorded trace and re-executing it with modifications. Want to know what happens if the database tool returns different data? Replay with a mocked tool response. Want to test a prompt change? Replay with the new system prompt and compare.

The divergence model

The core idea: walk the recorded span list, replaying each step from the original trace. When you override a tool result, set a diverged flag. After divergence, LLM calls must be re-executed (the model now sees different context), but tool calls can still use overrides or originals. Non-diverged steps replay from recorded data without hitting any APIs.

Before reading the implementation, hold this decision tree in your head -- it's the entire model:

text
for each span in original trace:
  if span is an LLM call:
    if diverged?  ──► RE-RUN against live model (context has changed)
    else          ──► REPLAY from recorded output (no API call)

  if span is a tool call:
    if has override?  ──► USE override, set diverged = true
    else              ──► REPLAY from recorded output

# diverged starts false, flips true on first tool override
# (or starts true if a system-prompt override was passed in)

The whole replay function is just this loop with a step-through callback bolted on. Everything else -- the diff tracking, the breakpoint support, the onStep UI hook -- is bookkeeping around the same four cases.

The replay function

The function below has three branches worth tracing carefully before reading: (1) the system-prompt override path, which forces a single re-run of the first LLM call and then re-uses recorded data downstream until a tool override appears; (2) the tool-override path, which is what flips the diverged flag and turns every subsequent LLM call into a live re-execution; and (3) the no-divergence default, which replays from the recorded output field without ever hitting an API. The onStep callback is what makes this usable as a step-through debugger -- the UI passes a Promise that resolves on a 'next' button click.

lib/agent-replay.ts
typescript
interface ReplayOptions {
  toolOverrides?: Map<string, unknown>;  // spanId -> mock result
  systemPromptOverride?: string;         // replace system prompt and force re-execution
  breakpointSpanId?: string;             // pause at this span
  onStep?: (span: Span, state: ReplayState) => Promise<'continue' | 'stop'>;
}

interface ReplayState {
  messages: Array<{ role: string; content: string }>;
  currentSpanIndex: number;
  totalSpans: number;
  divergedFromOriginal: boolean;
}

export async function replayTrace(
  originalSpans: Span[],
  options: ReplayOptions = {},
): Promise<{ result: string; diffs: SpanDiff[] }> {
  const flatSpans = flattenSpans(originalSpans);
  const diffs: SpanDiff[] = [];
  let messages: Array<{ role: string; content: string }> = [];
  let diverged = false;
  let replayedFirstLLM = false;

  // If system prompt is overridden, we diverge from the start
  if (options.systemPromptOverride) {
    messages.push({ role: 'system', content: options.systemPromptOverride });
    diverged = true;
  }

  for (let i = 0; i < flatSpans.length; i++) {
    const original = flatSpans[i];

    if (options.breakpointSpanId === original.id) break;

    // Step-through callback: lets UI pause between spans
    if (options.onStep) {
      const action = await options.onStep(original, {
        messages,
        currentSpanIndex: i,
        totalSpans: flatSpans.length,
        divergedFromOriginal: diverged,
      });
      if (action === 'stop') break;
    }

    // Re-run the first LLM call with the overridden system prompt.
    // Uses a flag instead of i === 0 because the first span may be
    // an 'agent' or 'reasoning' span, not an llm_call.
    if (options.systemPromptOverride && !replayedFirstLLM && original.kind === 'llm_call') {
      replayedFirstLLM = true;
      const response = await llm.chat(messages);
      diffs.push({ spanId: original.id, field: 'output', original: original.output, replayed: response.content });
      continue;
    }

    if (original.kind === 'tool_call') {
      const override = options.toolOverrides?.get(original.id);
      if (override !== undefined) {
        messages.push({ role: 'tool', content: JSON.stringify(override) });
        diffs.push({ spanId: original.id, field: 'output', original: original.output, replayed: override });
        diverged = true;
        continue;
      }
    }

    if (original.kind === 'llm_call' && diverged) {
      // After divergence, must re-run -- cached output reflects old context
      const response = await llm.chat(messages);
      if (response.content !== original.output) {
        diffs.push({ spanId: original.id, field: 'output', original: original.output, replayed: response.content });
      }
      continue;
    }

    // No divergence: replay from recorded data
    if (original.output) {
      messages.push({
        role: original.kind === 'tool_call' ? 'tool' : 'assistant',
        content: String(original.output),
      });
    }
  }

  return { result: messages.at(-1)?.content ?? '', diffs };
}

The onStep callback turns this into a step-through debugger. In the trace viewer UI, wire it to a "next step" button that resolves the promise -- IDE-style stepping through an agent's execution history. The implementation handles sequential tool execution correctly, which covers most single-agent architectures; parallel tool execution is the open problem I describe at the end of the post.


Trace Assertions

Structured traces enable automated assertions -- checks that run after every agent execution and flag problems before they reach users. I chose these six because they map directly to the failure taxonomy above: context drift (assertion 1), reasoning collapse and tool loops (2), fire-and-forget tool bugs (3), runaway agents (4, 5), and illegal state transitions (6). The thresholds are defaults; tune them per agent based on your workload.

lib/agent-assertions.ts
typescript
interface AssertionConfig {
  maxContextUtilization: number;  // default: 0.9
  maxDuplicateToolCalls: number;  // default: 3
  maxLLMCalls: number;            // default: 10 (a 15-step workflow agent needs higher)
  maxDurationMs: number;          // default: 30_000
}

const DEFAULTS: AssertionConfig = {
  maxContextUtilization: 0.9,
  maxDuplicateToolCalls: 3,
  maxLLMCalls: 10,
  maxDurationMs: 30_000,
};

export function runAssertions(
  spans: Span[],
  config: Partial<AssertionConfig> = {},
): AssertionResult[] {
  const cfg = { ...DEFAULTS, ...config };
  const flat = flattenSpans(spans);
  const results: AssertionResult[] = [];

  // 1. Context window headroom
  const maxCtx = Math.max(
    ...flat.filter(s => s.meta.contextWindowUtilization)
      .map(s => s.meta.contextWindowUtilization as number), 0
  );
  results.push({
    name: 'context-window-headroom',
    passed: maxCtx < cfg.maxContextUtilization,
    message: `Peak context: ${Math.round(maxCtx * 100)}%` +
      (maxCtx >= cfg.maxContextUtilization ? ' -- high risk of context drift' : ''),
    severity: 'error',
  });

  // 2. No tool call loops (same tool + same args repeated)
  const toolSigs = flat.filter(s => s.kind === 'tool_call')
    .map(s => `${s.name}:${JSON.stringify(s.input)}`);
  const maxDupes = Math.max(
    ...Array.from(new Set(toolSigs)).map(sig => toolSigs.filter(s => s === sig).length), 0
  );
  results.push({
    name: 'no-tool-loops',
    passed: maxDupes < cfg.maxDuplicateToolCalls,
    message: maxDupes >= cfg.maxDuplicateToolCalls
      ? `Tool called ${maxDupes}x with identical args -- possible loop` : 'No loops detected',
    severity: 'error',
  });

  // 3. No orphaned tool calls (started but never completed)
  const orphaned = flat.filter(s => s.kind === 'tool_call' && (!s.endTime || s.status === 'running'));
  results.push({
    name: 'no-orphaned-tools',
    passed: orphaned.length === 0,
    message: orphaned.length > 0 ? `${orphaned.length} tool calls never completed` : 'All tools completed',
    severity: 'error',
  });

  // 4. LLM call budget
  const llmCount = flat.filter(s => s.kind === 'llm_call').length;
  results.push({
    name: 'llm-call-budget',
    passed: llmCount <= cfg.maxLLMCalls,
    message: `${llmCount} LLM calls` + (llmCount > cfg.maxLLMCalls ? ` -- exceeds budget of ${cfg.maxLLMCalls}` : ''),
    severity: 'warning',
  });

  // 5. Execution time
  const root = flat.find(s => s.kind === 'agent');
  const duration = root?.endTime ? root.endTime - root.startTime : Infinity;
  results.push({
    name: 'execution-time',
    passed: duration < cfg.maxDurationMs,
    message: `${Math.round(duration / 1000)}s` +
      (duration >= cfg.maxDurationMs ? ` -- exceeds ${cfg.maxDurationMs / 1000}s budget` : ''),
    severity: 'warning',
  });

  // 6. State machine violations (from tracer)
  const violations = flat.filter(s => s.meta.stateViolation);
  results.push({
    name: 'no-state-violations',
    passed: violations.length === 0,
    message: violations.length > 0
      ? `${violations.length} illegal state transitions detected`
      : 'All state transitions valid',
    severity: 'error',
  });

  return results;
}

I run these in two places: as a post-execution check in development (fail the test if any assertion fires), and as an async monitor in production (pipe to alerting). The loop detection alone has caught three separate incidents where an agent retried a failing tool with the same parameters.

Note the configurable thresholds. A simple query agent with 3-4 tool calls needs different budgets than a research agent that legitimately makes 20 LLM calls across multiple sources. Default to strict, then loosen per agent when you have data showing the higher thresholds are expected behavior.


The Debugging Dashboard

After several iterations, I converged on a three-panel layout: trace tree (where am I?), detail pane (what happened here?), and context inspector (what did the agent know?). The context window panel on the right is the piece most tools are missing. In the schema-drift bug, it would have immediately shown the schema message being truncated at turn 31.

Trace: data-query-agent12 spans3 LLM calls2 tool calls0 errors2.4s totalCTX peak: 87%TRACE TREEAGENT data-query-agent 2.4sLLM turn-0 890msTOOL sql_query 340msLLM turn-1 720msTOOL format_chart 180msLLM turn-2 310msTIMELINE VIEWACTIONSReplayStep ThroughCompare RunDETAIL PANE -- sql_queryInput{"query": "SELECT SUM(amount)FROM q3_sales_2024"Output{ "rows": [], "count": 0 }Metadataduration: 340ms status: okAnomaly DetectedTable "q3_sales_2024" not inschema. Possible hallucination.Schema has: quarterly_sales, revenueCONTEXT WINDOWAt span: LLM turn-087% utilized (111k / 128k tokens)Breakdown:System prompt: 2,400 tokensSchema: 8,200 tokensConversation: 94,100 tokensAvailable: 23,300 tokensMessages in context1. system (2.4k tok)2. user (120 tok)... 47 messages ...!! Schema msg truncated at turn 3150. assistant (512 tok)State TransitionsIDLE -> THINKING -> ACTING ->OBSERVING -> THINKING -> DONE
Debugging dashboard wireframe. Three simultaneous views: trace tree (where), detail pane (what), context inspector (why).

The implementation isn't the interesting part -- it's a standard collapsible tree, a JSON inspector, and a sidebar. What matters is what each panel surfaces by default. The version that earned its keep ended up looking like this:

  • Trace tree -- catches tool misrouting and reasoning collapse. One row per span, color-coded by kind, with duration and status inline. Collapse everything below the agent root by default. When a tool was called that shouldn't have been, you spot it in the tree at a glance; when reasoning becomes circular, the repeated THINKING rows tell you immediately. The goal is answering 'where am I?' in one glance, not maximizing information density.
  • Detail pane -- catches hallucinated parameters. Input, output, metadata, and an anomaly-detection footer that runs the assertion suite against the selected span. Tool inputs that reference nonexistent fields, table names that aren't in the schema, IDs that don't match any record -- they all get flagged here, in the same panel where the user is already looking. If the span fails an assertion, the failure reason goes here, not in a separate panel.
  • Context inspector -- catches context drift and silent degradation. The panel most observability tools are missing. Show context utilization as a single horizontal bar, then break it down by message role (system, schema, conversation, tool results). For long traces, show a small inline warning when a known message has been truncated. In the schema-drift bug, this is the panel that would have surfaced the schema disappearing at turn 31 without anyone needing to ask the right question.
  • State badges -- catches goal drift and illegal transitions. A tiny row above the trace tree showing the last five state transitions, with illegal ones highlighted in red. THINKING -> DONE with pending tool context lights up before the user reads the response. This is how you catch 'agent skipped OBSERVING' or 'agent gave up halfway' bugs without scrolling through 50 spans.

The one piece of the context inspector that's worth showing concretely is how the per-role token breakdown is computed. The tracer already records the raw messages array on each LLM call span; the inspector groups those messages by role and sums their token counts, with a special bucket for the schema/system context so it stands out visually.

lib/context-inspector.ts
typescript
type Role = 'system' | 'schema' | 'user' | 'assistant' | 'tool';

interface ContextBreakdown {
  total: number;
  utilization: number;
  byRole: Record<Role, number>;
}

export function inspectContext(
  messages: Array<{ role: string; content: string; meta?: { kind?: Role } }>,
  modelLimit = 128_000,
): ContextBreakdown {
  const byRole: Record<Role, number> = {
    system: 0, schema: 0, user: 0, assistant: 0, tool: 0,
  };

  for (const msg of messages) {
    // Schema messages are tagged at injection time with meta.kind = 'schema'
    // so they don't get folded into the generic 'system' bucket.
    const role = (msg.meta?.kind ?? msg.role) as Role;
    byRole[role] = (byRole[role] ?? 0) + estimateTokens(msg.content);
  }

  const total = Object.values(byRole).reduce((sum, n) => sum + n, 0);
  return { total, utilization: total / modelLimit, byRole };
}

Every other piece of UI I built for this dashboard ended up being decoration. If you're starting from scratch, build those four panels first and resist the urge to add anything else until you've used the dashboard on three real bugs.


Patterns That Pay Off

Four habits that compound, ranked by how many incidents they've saved me from.

  • Record everything, display selectively. The schema-drift bug eats hours when you have to re-run the agent just to capture the trace you need. If the failing trace is already on disk, you catch the schema drop in fifteen minutes. Storage is cheap; reproducing a failure twelve hours later is not.
  • Make context utilization a first-class metric. Track it like you track CPU. Alert at 80%. Every context-drift bug I've run into -- including the schema-drift one -- was predictable from a utilization graph nobody was looking at. The fix isn't smarter prompts; it's a 50-line dashboard panel.
  • Diff traces, not outputs. When you have one working run and one broken run, don't compare the final answers -- compare the spans in order. The first span where the two traces diverge is almost always the root cause. In the schema-drift bug, the diff would have pointed at turn 31 (the last LLM call that still saw the schema) immediately.
  • Build replay before you need it. Retrofitting replay into an existing agent is painful: the trace format, the tool abstraction, the deterministic mode -- all of it is much easier to build on day one than at 2am during an incident. You'll only appreciate this advice the first time you try to debug a production failure that you can't reproduce in dev.

Open Problems I Haven't Solved

Rather than a generic roadmap, here are the specific problems I'm still working through:

  • Replay with parallel tool calls. The current replay function assumes sequential tool execution. When agents call multiple tools in parallel (common with function-calling models), the span ordering becomes ambiguous. I've tried timestamp-based reconstruction but it introduces race conditions in the replay that weren't in the original execution.
  • Cross-agent trace linking. When Agent A delegates to Agent B, the traces are separate. Parent-child linking across agent boundaries sounds simple, but the agents often run in different processes or even different services. Propagating a trace ID through tool-call boundaries without coupling the agents is unsolved in my current setup.
  • Anomaly detection without labeled data. I can detect hard failures (assertion violations, state machine errors). Detecting soft failures -- response quality that's subtly worse than usual -- requires baselines I don't have. Statistical approaches (comparing context utilization distributions across runs) show promise but generate too many false positives to be useful in alerting.
  • Cost attribution that's actually useful. I track total token spend per trace, but what I really want is cost-per-user-query broken down by which tool calls were productive vs. wasted. An agent that takes 3 attempts to get a SQL query right costs 3x -- but my current cost tracking just shows the total, not that the first two attempts were thrown away.

Everything in this post comes out of actually sitting with agents like this and debugging them -- the AgentTracer, the state machine, the assertions, and the replay function are all built to run against live agent traffic.

They're designed to be dropped into any TypeScript agent codebase with minimal modification. Take them as a starting point, then adapt the thresholds and state transitions to match the failures your agents actually exhibit.

Imagine the other ending. The same agent flags a context-drift warning at 82% utilization before anything breaks. You compress the conversation history and fix it before the next batch of queries goes out. Nobody gets a wrong number. Nobody loses an afternoon to it. The infrastructure is the easy part -- the hard part is paying attention to it before the next quiet failure slips through. Build it before you need it.