LLM2025_11_30

The LLM Agent Debugging Problem. Building Observability from Scratch

A practical guide to debugging LLM agents, with structured logging utilities, trace visualization components, and replay infrastructure you can steal for your own projects.

ACTIVE_PHASE: PALLAV // 18 MIN READ

Last month I shipped an agent that helps sales teams query their CRM data. It worked great in testing. In production, it started generating SQL queries against tables that didn't exist, confidently formatting the empty results into polished summaries, and sending them to users. No errors anywhere. HTTP 200s across the board. The logs showed clean, successful completions.

It took me four hours to find the root cause. The agent's context window had silently dropped the database schema after a long conversation, so it started hallucinating table names. The tool call succeeded (the query ran, returned zero rows), and the agent interpreted "no results" as "no sales this quarter." A VP got a report showing zero revenue.

That experience broke something in my brain about how I approach agent reliability. I had structured logging. I had error tracking. None of it caught a failure that wasn't an error -- it was an agent making a reasonable decision from bad information.

Why Agent Bugs Don't Look Like Normal Bugs

Traditional software fails loudly. An uncaught exception, a 500 status code, a type error -- you get a stack trace pointing at the line. Agent failures are different. The system behaves exactly as designed at every individual step, but the emergent behavior is wrong.

Here's my taxonomy of agent failure modes, built from a year of production incidents:

Failure Mode	What Happens	Why Logs Miss It
Context drift	Agent loses critical info as context window fills	Each step logs correctly; the missing context is invisible
Tool misrouting	Agent picks the wrong tool for the task	Tool call succeeds -- wrong tool, right execution
Hallucinated parameters	Agent invents plausible but wrong arguments	Arguments look valid; no schema validation failure
Reasoning collapse	Agent's chain-of-thought becomes circular or contradictory	Each thought is logged but no one reads 200 lines of reasoning
Silent degradation	Response quality drops gradually over a conversation	No single step is wrong -- quality is a gradient
Goal drift	Agent subtly reinterprets the original task	The agent is solving a problem, just not the problem

Notice the pattern: every failure mode involves steps that succeed individually but compose into something broken. You need to see the trajectory, not individual frames.

The Mental Model: Agent State Machine

Before writing any debugging infrastructure, I needed a model for what "correct" agent behavior looks like. Every agent I've built or debugged follows the same lifecycle: IDLE -> THINKING -> ACTING -> OBSERVING -> (loop or DONE). Formalizing this turned vague bug reports ("the agent did something weird") into precise diagnoses ("it transitioned from THINKING to DONE without going through OBSERVING -- it ignored the tool result").

The value of this model is that it makes illegal transitions detectable at runtime. I encode it in the tracer as a transition table -- if the agent tries to go from THINKING to DONE when there are pending tool calls, or from ACTING to ACTING without an OBSERVING step in between, the tracer flags it immediately:

CODE_MANIFESTlib/state-machine.ts

type AgentState = 'idle' | 'thinking' | 'acting' | 'observing' | 'done' | 'error';

// Valid transitions: from -> allowed next states
const TRANSITIONS: Record<AgentState, AgentState[]> = {
  idle:      ['thinking'],
  thinking:  ['acting', 'done', 'error'],  // done = no tool calls needed
  acting:    ['observing', 'error'],        // must observe tool results
  observing: ['thinking', 'error'],         // back to reasoning
  done:      [],                            // terminal
  error:     [],                            // terminal
};

export class AgentStateMachine {
  private state: AgentState = 'idle';
  private history: { from: AgentState; to: AgentState; spanId: string }[] = [];

  transition(to: AgentState, spanId: string): { valid: boolean; violation?: string } {
    const allowed = TRANSITIONS[this.state];

    if (!allowed.includes(to)) {
      const violation = `Illegal transition: ${this.state} -> ${to} at span ${spanId}. ` +
        `Allowed from ${this.state}: [${allowed.join(', ')}]`;
      this.history.push({ from: this.state, to, spanId });
      return { valid: false, violation };
    }

    this.history.push({ from: this.state, to, spanId });
    this.state = to;
    return { valid: true };
  }

  // Detect loops: same state sequence repeating 3+ times
  detectLoop(): { looping: boolean; pattern?: string } {
    if (this.history.length < 6) return { looping: false };

    const recent = this.history.slice(-6).map(h => h.to).join('->');
    const half = this.history.slice(-3).map(h => h.to).join('->');
    if (recent === `${half}->${half}`) {
      return { looping: true, pattern: half };
    }
    return { looping: false };
  }
}

In the CRM bug, this would have caught the agent transitioning from THINKING to DONE at turn 31 -- exactly when the schema dropped out of context. The agent "decided" it was done because it couldn't see the tools anymore, not because it had an answer. The state machine makes this visible: a THINKING -> DONE transition with pending tool context is a red flag.

The Structured Logging Foundation

With the mental model in place, the tracer captures agent execution as a tree of spans (not a flat list). Each span knows its parent, its type, and carries typed metadata. The state machine is wired into every span creation, so illegal transitions are caught live.

CODE_MANIFESTlib/agent-logger.ts

import { randomUUID } from 'crypto';

type SpanKind = 'agent' | 'llm_call' | 'tool_call' | 'reasoning' | 'observation' | 'error';

interface Span {
  id: string;
  traceId: string;
  parentId: string | null;
  kind: SpanKind;
  name: string;
  startTime: number;
  endTime?: number;
  status: 'running' | 'ok' | 'error';
  input?: unknown;
  output?: unknown;
  meta: Record<string, unknown>;
  children: Span[];
}

export class AgentTracer {
  private spans: Map<string, Span> = new Map();
  private traceId: string;
  private stateMachine = new AgentStateMachine();
  private onFlush?: (spans: Span[]) => void;

  constructor(opts?: { traceId?: string; onFlush?: (spans: Span[]) => void }) {
    this.traceId = opts?.traceId ?? randomUUID();
    this.onFlush = opts?.onFlush;
  }

  startSpan(kind: SpanKind, name: string, parentId?: string | null, input?: unknown): Span {
    const span: Span = {
      id: randomUUID(),
      traceId: this.traceId,
      parentId: parentId ?? null,
      kind, name, input,
      startTime: Date.now(),
      status: 'running',
      meta: {},
      children: [],
    };

    // Validate state transition
    const agentState = spanKindToState(kind);
    if (agentState) {
      const result = this.stateMachine.transition(agentState, span.id);
      if (!result.valid) {
        span.meta.stateViolation = result.violation;
        console.warn(`[AgentTracer] ${result.violation}`);
      }
    }

    this.spans.set(span.id, span);
    if (parentId) {
      this.spans.get(parentId)?.children.push(span);
    }
    return span;
  }

  endSpan(id: string, output?: unknown, status: 'ok' | 'error' = 'ok'): void {
    const span = this.spans.get(id);
    if (!span) return;
    span.endTime = Date.now();
    span.output = output;
    span.status = status;
  }

  // Capture context window state at every LLM call
  recordContextWindow(spanId: string, messages: { role: string; tokens: number }[]): void {
    const span = this.spans.get(spanId);
    if (!span) return;
    const total = messages.reduce((sum, m) => sum + m.tokens, 0);
    span.meta.contextWindowUtilization = total / 128_000; // adjust per model
    span.meta.contextMessages = messages.length;
    span.meta.contextTokens = total;
  }

  getTrace(): Span[] {
    return Array.from(this.spans.values()).filter(s => s.parentId === null);
  }

  flush(): void { this.onFlush?.(Array.from(this.spans.values())); }
}

function spanKindToState(kind: SpanKind): AgentState | null {
  const map: Partial<Record<SpanKind, AgentState>> = {
    llm_call: 'thinking', tool_call: 'acting', observation: 'observing',
  };
  return map[kind] ?? null;
}

The context window field matters most

That contextWindowUtilization field has caught more bugs than everything else combined. When it crosses 0.8, you're in the danger zone for context drift. When it crosses 0.9, the agent should compress or summarize -- never silently truncate.

Wiring it into an agent loop is straightforward. The tracer wraps every LLM call and tool call, recording context window state at each step:

CODE_MANIFESTlib/agent-loop.ts

const tracer = new AgentTracer({
  onFlush: (spans) => saveToDatabase(spans),
});

async function agentLoop(query: string): Promise<string> {
  const rootSpan = tracer.startSpan('agent', 'sales-query-agent', null, { query });

  try {
    let messages = [systemPrompt, { role: 'user', content: query }];

    for (let turn = 0; turn < 10; turn++) {
      const llmSpan = tracer.startSpan('llm_call', `turn-${turn}`, rootSpan.id);
      tracer.recordContextWindow(llmSpan.id, messages.map(m => ({
        role: m.role, tokens: estimateTokens(m.content),
      })));

      const response = await llm.chat(messages);
      tracer.endSpan(llmSpan.id, response.content);

      if (!response.toolCalls?.length) {
        tracer.endSpan(rootSpan.id, response.content);
        tracer.flush();
        return response.content;
      }

      for (const call of response.toolCalls) {
        const toolSpan = tracer.startSpan('tool_call', call.name, rootSpan.id, call.arguments);
        try {
          const result = await executeTool(call.name, call.arguments);
          tracer.endSpan(toolSpan.id, result);
          messages.push({ role: 'tool', content: JSON.stringify(result) });
        } catch (err) {
          tracer.endSpan(toolSpan.id, { error: String(err) }, 'error');
        }
      }
    }

    tracer.endSpan(rootSpan.id, 'max iterations', 'error');
    tracer.flush();
    return 'Unable to complete the task.';
  } catch (err) {
    tracer.endSpan(rootSpan.id, { error: String(err) }, 'error');
    tracer.flush();
    throw err;
  }
}

The estimateTokens function used above is a rough heuristic -- split on whitespace and divide by 0.75 (English averages ~0.75 tokens per word). For exact counts, use tiktoken or your model provider's tokenizer. The heuristic is good enough for context window monitoring since you're watching for 80%+ utilization, not counting exact tokens.

CODE_MANIFESTLANG: TYPESCRIPT

// Good enough for monitoring. Use tiktoken for exact counts.
function estimateTokens(text: string): number {
  if (!text) return 0;
  return Math.ceil(text.split(/\s+/).length / 0.75);
}

Don't Log Full Prompts in Production

This deserves its own section because I've seen it go wrong at multiple companies. In development, log everything -- full prompts, full responses, full tool results. In production, log summaries and token counts only.

Three reasons: (1) User prompts contain PII. Logging them to your observability backend means your Datadog or Elastic instance now holds customer data subject to GDPR/CCPA, which your compliance team probably didn't sign off on. (2) Full prompts at scale are expensive to store. A busy agent generating 100K traces/day with 128K-token context windows is 12TB/day of raw log data. (3) Full prompt logs slow down your log pipeline and make searching slower for everyone.

CODE_MANIFESTLANG: TYPESCRIPT

// Use a log level flag, not a blanket policy
const LOG_LEVEL = process.env.AGENT_LOG_LEVEL ?? 'summary'; // full | summary | minimal

function logLLMCall(span: Span, messages: Message[]) {
  if (LOG_LEVEL === 'full') {
    span.input = messages; // dev only
  } else if (LOG_LEVEL === 'summary') {
    span.input = {
      messageCount: messages.length,
      roles: messages.map(m => m.role),
      lastUserMessage: messages.findLast(m => m.role === 'user')?.content?.slice(0, 100),
      tokenEstimate: messages.reduce((sum, m) => sum + estimateTokens(m.content), 0),
    };
  }
  // minimal: no input logged at all
}

Existing Tools and Their Design Trade-offs

I've used the major agent observability tools in production. Here's my assessment, but with an important caveat: the "gaps" I describe are often deliberate design choices, not oversights.

Tool	Good At	Intentional Trade-off	Best For
LangSmith	Chain tracing, eval datasets, prompt versioning	Flat call list over tree view -- stays framework-agnostic	LangChain-native projects
Arize Phoenix	Model-agnostic tracing, drift detection	Generic span model over agent-specific patterns	ML teams with existing Arize infra
LangFuse	Open source, clean trace UI, cost tracking	Broad compatibility over deep agent introspection	Teams that want self-hosted
Braintrust	Eval-first workflow, scoring	Eval focus over live debugging	Systematic evaluation pipelines
Custom (this post)	Context inspection, replay, state validation	Setup cost, maintenance burden, team onboarding	Teams with unique agent architectures

These tools treat agent execution as flat LLM call lists partly by design -- to support any model, any framework, any orchestration pattern. That's a reasonable trade-off. The cost of custom tooling is real: you're maintaining debugging infrastructure instead of using it. For most teams, LangSmith or LangFuse covers 80% of what you need. The custom approach is worth it when you have agent architectures that don't map cleanly to call-list models -- multi-agent systems, agents with persistent state, or the kind of context-window bugs I described above.

Replay: Time-Travel Debugging for Agents

The most powerful technique I've found is replay -- taking a recorded trace and re-executing it with modifications. Want to know what happens if the database tool returns different data? Replay with a mocked tool response. Want to test a prompt change? Replay with the new system prompt and compare.

The core idea: walk the recorded span list, replaying each step from the original trace. When you override a tool result, set a diverged flag. After divergence, LLM calls must be re-executed (the model now sees different context), but tool calls can still use overrides or originals. Non-diverged steps replay from recorded data without hitting any APIs.

CODE_MANIFESTlib/agent-replay.ts

interface ReplayOptions {
  toolOverrides?: Map<string, unknown>;  // spanId -> mock result
  systemPromptOverride?: string;         // replace system prompt and force re-execution
  breakpointSpanId?: string;             // pause at this span
  onStep?: (span: Span, state: ReplayState) => Promise<'continue' | 'stop'>;
}

interface ReplayState {
  messages: Array<{ role: string; content: string }>;
  currentSpanIndex: number;
  totalSpans: number;
  divergedFromOriginal: boolean;
}

export async function replayTrace(
  originalSpans: Span[],
  options: ReplayOptions = {},
): Promise<{ result: string; diffs: SpanDiff[] }> {
  const flatSpans = flattenSpans(originalSpans);
  const diffs: SpanDiff[] = [];
  let messages: Array<{ role: string; content: string }> = [];
  let diverged = false;
  let replayedFirstLLM = false;

  // If system prompt is overridden, we diverge from the start
  if (options.systemPromptOverride) {
    messages.push({ role: 'system', content: options.systemPromptOverride });
    diverged = true;
  }

  for (let i = 0; i < flatSpans.length; i++) {
    const original = flatSpans[i];

    if (options.breakpointSpanId === original.id) break;

    // Step-through callback: lets UI pause between spans
    if (options.onStep) {
      const action = await options.onStep(original, {
        messages,
        currentSpanIndex: i,
        totalSpans: flatSpans.length,
        divergedFromOriginal: diverged,
      });
      if (action === 'stop') break;
    }

    // Re-run the first LLM call with the overridden system prompt.
    // Uses a flag instead of i === 0 because the first span may be
    // an 'agent' or 'reasoning' span, not an llm_call.
    if (options.systemPromptOverride && !replayedFirstLLM && original.kind === 'llm_call') {
      replayedFirstLLM = true;
      const response = await llm.chat(messages);
      diffs.push({ spanId: original.id, field: 'output', original: original.output, replayed: response.content });
      continue;
    }

    if (original.kind === 'tool_call') {
      const override = options.toolOverrides?.get(original.id);
      if (override !== undefined) {
        messages.push({ role: 'tool', content: JSON.stringify(override) });
        diffs.push({ spanId: original.id, field: 'output', original: original.output, replayed: override });
        diverged = true;
        continue;
      }
    }

    if (original.kind === 'llm_call' && diverged) {
      // After divergence, must re-run -- cached output reflects old context
      const response = await llm.chat(messages);
      if (response.content !== original.output) {
        diffs.push({ spanId: original.id, field: 'output', original: original.output, replayed: response.content });
      }
      continue;
    }

    // No divergence: replay from recorded data
    if (original.output) {
      messages.push({
        role: original.kind === 'tool_call' ? 'tool' : 'assistant',
        content: String(original.output),
      });
    }
  }

  return { result: messages.at(-1)?.content ?? '', diffs };
}

Known limitation: span ordering

The replay function flattens the span tree with a depth-first traversal. If the original trace has interleaved tool calls from different agent turns (e.g., parallel tool execution), the reconstructed messages array may not match the original order. For agents with parallel tool calling, you need to reconstruct messages by timestamp rather than tree position. This implementation handles sequential tool execution correctly, which covers most single-agent architectures.

The onStep callback turns this into a step-through debugger. In the trace viewer UI, wire it to a "next step" button that resolves the promise -- IDE-style stepping through an agent's execution history.

Trace Assertions

Structured traces enable automated assertions -- checks that run after every agent execution and flag problems before they reach users. I chose these six because they map directly to the failure taxonomy above: context drift (assertion 1), reasoning collapse and tool loops (2), fire-and-forget tool bugs (3), runaway agents (4, 5), and illegal state transitions (6). The thresholds are defaults; tune them per agent based on your workload.

CODE_MANIFESTlib/agent-assertions.ts

interface AssertionConfig {
  maxContextUtilization: number;  // default: 0.9
  maxDuplicateToolCalls: number;  // default: 3
  maxLLMCalls: number;            // default: 10 (a 15-step workflow agent needs higher)
  maxDurationMs: number;          // default: 30_000
}

const DEFAULTS: AssertionConfig = {
  maxContextUtilization: 0.9,
  maxDuplicateToolCalls: 3,
  maxLLMCalls: 10,
  maxDurationMs: 30_000,
};

export function runAssertions(
  spans: Span[],
  config: Partial<AssertionConfig> = {},
): AssertionResult[] {
  const cfg = { ...DEFAULTS, ...config };
  const flat = flattenSpans(spans);
  const results: AssertionResult[] = [];

  // 1. Context window headroom
  const maxCtx = Math.max(
    ...flat.filter(s => s.meta.contextWindowUtilization)
      .map(s => s.meta.contextWindowUtilization as number), 0
  );
  results.push({
    name: 'context-window-headroom',
    passed: maxCtx < cfg.maxContextUtilization,
    message: `Peak context: ${Math.round(maxCtx * 100)}%` +
      (maxCtx >= cfg.maxContextUtilization ? ' -- high risk of context drift' : ''),
    severity: 'error',
  });

  // 2. No tool call loops (same tool + same args repeated)
  const toolSigs = flat.filter(s => s.kind === 'tool_call')
    .map(s => `${s.name}:${JSON.stringify(s.input)}`);
  const maxDupes = Math.max(
    ...Array.from(new Set(toolSigs)).map(sig => toolSigs.filter(s => s === sig).length), 0
  );
  results.push({
    name: 'no-tool-loops',
    passed: maxDupes < cfg.maxDuplicateToolCalls,
    message: maxDupes >= cfg.maxDuplicateToolCalls
      ? `Tool called ${maxDupes}x with identical args -- possible loop` : 'No loops detected',
    severity: 'error',
  });

  // 3. No orphaned tool calls (started but never completed)
  const orphaned = flat.filter(s => s.kind === 'tool_call' && (!s.endTime || s.status === 'running'));
  results.push({
    name: 'no-orphaned-tools',
    passed: orphaned.length === 0,
    message: orphaned.length > 0 ? `${orphaned.length} tool calls never completed` : 'All tools completed',
    severity: 'error',
  });

  // 4. LLM call budget
  const llmCount = flat.filter(s => s.kind === 'llm_call').length;
  results.push({
    name: 'llm-call-budget',
    passed: llmCount <= cfg.maxLLMCalls,
    message: `${llmCount} LLM calls` + (llmCount > cfg.maxLLMCalls ? ` -- exceeds budget of ${cfg.maxLLMCalls}` : ''),
    severity: 'warning',
  });

  // 5. Execution time
  const root = flat.find(s => s.kind === 'agent');
  const duration = root?.endTime ? root.endTime - root.startTime : Infinity;
  results.push({
    name: 'execution-time',
    passed: duration < cfg.maxDurationMs,
    message: `${Math.round(duration / 1000)}s` +
      (duration >= cfg.maxDurationMs ? ` -- exceeds ${cfg.maxDurationMs / 1000}s budget` : ''),
    severity: 'warning',
  });

  // 6. State machine violations (from tracer)
  const violations = flat.filter(s => s.meta.stateViolation);
  results.push({
    name: 'no-state-violations',
    passed: violations.length === 0,
    message: violations.length > 0
      ? `${violations.length} illegal state transitions detected`
      : 'All state transitions valid',
    severity: 'error',
  });

  return results;
}

I run these in two places: as a post-execution check in development (fail the test if any assertion fires), and as an async monitor in production (pipe to alerting). The loop detection alone has caught three production incidents where an agent retried a failing tool with the same parameters.

Note the configurable thresholds. A CRM query agent with 3-4 tool calls needs different budgets than a research agent that legitimately makes 20 LLM calls across multiple sources. Default to strict, then loosen per agent when you have data showing the higher thresholds are expected behavior.

The Debugging Dashboard

After several iterations, I converged on a three-panel layout: trace tree (where am I?), detail pane (what happened here?), and context inspector (what did the agent know?). The context window panel on the right is the piece most tools are missing. In the CRM bug, it would have immediately shown the schema message being truncated at turn 31.

I won't include the full TraceViewer React component here -- it's a standard collapsible tree renderer with span-kind color coding, duration labels, and inline JSON inspection. The interesting parts are the context window warning badges (red at >90%, amber at >80%) and the state violation markers.

Log Entry Schema

Each entry type serves a different debugging purpose. The context.snapshot entry is the one I wish every agent framework emitted by default -- it's the fastest way to diagnose context drift.

Entry Type	Key Fields	Debugging Use
`agent.start`	traceId, query, availableTools[]	Reconstruct initial conditions
`llm.request`	spanId, messageCount, model, temperature, contextUtil	What the model saw (summary, not full prompt)
`llm.response`	spanId, content summary, toolCalls[], usage, latency	What the model produced and at what cost
`tool.call`	spanId, toolName, arguments, parentSpanId	Verify the agent chose the right tool
`tool.result`	spanId, result summary, duration, status	Check tool output format and content
`context.snapshot`	spanId, tokenCounts, messageCount, utilization, truncatedMessages[]	Detect context drift before it causes problems
`state.transition`	spanId, from, to, valid, violation?	Catch illegal state machine transitions
`agent.end`	traceId, result summary, totalDuration, assertionResults[]	Overall execution summary with automated checks

Patterns That Pay Off

1. Record everything, display selectively

Capture full execution traces always. Build your UI to show summaries by default and let developers drill into details on demand. Storage is cheap; re-running a failing agent to capture missing data is not.

2. Make context window a first-class metric

Track context window utilization like you track CPU usage. Set alerts at 80%. Log what got truncated. Every context drift bug I've seen was predictable from utilization metrics.

3. Diff traces, not outputs

When comparing a working run to a broken run, don't just diff the final output. Diff the traces span-by-span. The first span where inputs or outputs diverge points at the root cause.

4. Build replay into your pipeline from day one

Retrofitting replay into an existing agent is painful. The trace format, the tool abstraction layer, the deterministic execution mode -- these are much easier to build at the start than to bolt on at 2 AM during an incident.

Open Problems I Haven't Solved

Rather than a generic roadmap, here are the specific problems I'm still working through:

Replay with parallel tool calls. The current replay function assumes sequential tool execution. When agents call multiple tools in parallel (common with function-calling models), the span ordering becomes ambiguous. I've tried timestamp-based reconstruction but it introduces race conditions in the replay that weren't in the original execution.
Cross-agent trace linking. When Agent A delegates to Agent B, the traces are separate. Parent-child linking across agent boundaries sounds simple, but the agents often run in different processes or even different services. Propagating a trace ID through tool-call boundaries without coupling the agents is unsolved in my current setup.
Anomaly detection without labeled data. I can detect hard failures (assertion violations, state machine errors). Detecting soft failures -- response quality that's subtly worse than usual -- requires baselines I don't have. Statistical approaches (comparing context utilization distributions across runs) show promise but generate too many false positives to be useful in alerting.
Cost attribution that's actually useful. I track total token spend per trace, but what I really want is cost-per-user-query broken down by which tool calls were productive vs. wasted. An agent that takes 3 attempts to get a SQL query right costs 3x -- but my current cost tracking just shows the total, not that the first two attempts were thrown away.

The code in this post is extracted from production systems. The AgentTracer, the state machine, the assertions, and the replay function are designed to be dropped into any TypeScript agent codebase with minimal modification. Use them as a starting point, then adapt the thresholds and state transitions to match your agent's actual behavior.