AI Observability: Debugging Systems That Think In Tokens

Have you ever tried debugging an AI feature where the user says, "It gave me a weird answer," and your logs only show a normal 200 response?

That's the moment traditional observability starts feeling incomplete. The request didn't crash. The database didn't time out. CPU looked fine. But the product still failed.

AI systems don't only fail with exceptions. They fail through bad retrieval, confusing prompts, wrong tool calls, missing context, stale memory, unsafe assumptions, high latency, and answers that sound correct until a human reads them twice.

An AI observability dashboard with panels for prompts, model responses, tool calls, retrieval results, traces, token usage, latency, cost, evals, and errors. — An AI observability dashboard: prompts, responses, tool calls, retrieval, traces, token usage, latency, cost, evals, and errors — all in one view.

Normal Logs Are Not Enough

In a classic web app, you usually debug with request logs, stack traces, database queries, metrics, and maybe distributed traces.

That still matters. But AI adds another layer. You need to know what prompt was sent, which context was retrieved, which model responded, which tools were called, how much it cost, how long each step took, and whether the final answer was actually good.

Debugging AI without that is like trying to diagnose a restaurant complaint by only checking whether the kitchen lights were on.

What You Need To Capture

User input. The original request, safely redacted where needed.
System instructions. The rules and constraints active for that run.
Retrieved context. Documents, chunks, metadata, and scores used by the model.
Model calls. Model name, parameters, latency, token usage, and response.
Tool calls. Tool name, arguments, outputs, errors, and approval steps.
Final output. The answer shown to the user.
Feedback and evals. Human ratings, automated checks, and known test cases.

That's the trace of an AI system. Not just "request started" and "request finished," but the decision path in between.

Traces Tell The Story

A trace for an AI agent should show every meaningful step.

The user asked a question. The system rewrote it. Retrieval found five chunks. The model chose a tool. The tool returned partial data. The model answered. The user disliked the answer.

That chain matters because AI failures are often chain failures. One weak step poisons the next.

A debugging workflow for AI applications showing user request, prompt construction, retrieval, model call, tool calls, final response, and evaluation checkpoints. — An AI debugging workflow: user request → prompt → retrieval → model call → tool calls → final response → evaluation checkpoints.

A Simple Trace Shape

JSON ai_trace.json

{
  "run_id": "run_123",
  "user_id": "user_456",
  "steps": [
    {"type": "retrieval", "top_k": 8, "latency_ms": 120},
    {"type": "model_call", "model": "example-model", "tokens": 2400},
    {"type": "tool_call", "tool": "search_docs", "status": "ok"},
    {"type": "final_answer", "feedback": "thumbs_down"}
  ]
}

This is simplified, but the idea is practical. You want enough structure to answer: what happened, why did it happen, and where did it go wrong?

Prompt And Context Versioning

If prompts are part of the system, prompts need versioning.

A small system prompt change can alter behavior across many requests. A new retrieval filter can change which documents the model sees. A different model version can change reasoning style, latency, or tool usage.

Prompts are like database migrations for behavior. You wouldn't change schema in production without tracking it. Don't change AI instructions without tracking them either.

Pro Tips

Version prompts. Store prompt templates with IDs and release notes.
Record context sources. Save document IDs, timestamps, and retrieval scores.
Track model versions. A model upgrade is a behavior change.
Capture tool schemas. Tool argument formats affect agent behavior.
Compare runs. Replay known examples against new prompts or models.

A prompt registry can be very simple:

YAML prompts/support-answer.yaml

id: support-answer
version: 2026-04-04
rules:
  - Answer only from retrieved context.
  - Cite the source document when possible.
  - Say you are unsure if context is insufficient.

The point is not fancy tooling. The point is change control.

Cost And Latency Are Product Signals

AI features can fail financially before they fail technically.

A workflow that calls the model six times, retrieves too many chunks, and uses a large model for every step may work beautifully in testing and become painful at scale. Latency matters too. Users don't care that your agent had a thoughtful inner journey if the answer arrives after they've already switched tabs.

Think of tokens like database queries in the early days of web apps. At first, nobody watches them closely. Then the bill arrives.

Conceptual visualization of an AI observability system: dashboard panels for the model and retrieval layers connected to traces of prompts, tool calls, and token streams flowing across the application. — AI observability in motion: prompts, tool calls, and token streams flowing into dashboards that surface latency, cost, and failure modes.

Metrics Worth Tracking

Cost per request. Especially by feature, tenant, or workflow.
Tokens per step. Input and output tokens tell different stories.
Latency per model call. One slow step can dominate the whole experience.
Tool-call count. Agents that wander usually cost more.
Failure rate by step. Retrieval failures are different from tool failures.
User feedback. A cheap bad answer is still bad.

A basic cost log might include:

JSON ai_usage_event.json

{
  "feature": "support_rag",
  "tenant_id": "tenant_42",
  "input_tokens": 3200,
  "output_tokens": 480,
  "tool_calls": 2,
  "latency_ms": 4300,
  "estimated_cost_usd": 0.018
}

You don't need perfect accounting on day one. You do need enough visibility to spot runaway workflows.

Evals Are Your Regression Tests

AI behavior changes even when code doesn't.

That's why evals matter. An eval is a repeatable check that tells you whether your AI system still behaves acceptably on known examples. It's not always a unit test. Sometimes it's a score, a rubric, a human review queue, or a golden dataset.

Evals are like smoke detectors. They don't prevent fire, but they tell you when something is burning before the whole house smells like smoke.

Common Eval Types

Exact checks. Useful for structured outputs, JSON, SQL, or classification labels.
Source checks. Useful for RAG systems that must cite expected documents.
Rubric scoring. Useful for answer quality, tone, completeness, and safety.
Tool-use checks. Useful for agents that must call the right tool.
Regression sets. Useful for real bugs that should never return.

A small structured-output eval might look like this:

Python evals/test_json_output.py

def test_response_contains_required_fields(ai_response):
    assert "summary" in ai_response
    assert "risk_level" in ai_response
    assert ai_response["risk_level"] in ["low", "medium", "high"]

Simple? Yes. Useful? Also yes. Not every eval needs to be a research project.

Final Tips

The AI bugs that scare me most are not loud crashes. They're quiet wrong answers with normal HTTP status codes. That's why observability needs to include prompts, context, tools, cost, latency, and quality signals.

My opinion: AI observability will become a normal part of production engineering, just like logs and traces did. The teams that invest early will debug faster and trust their systems more.

Log the journey, not just the destination. Good luck debugging the token machine 👊

AI Observability: How Do You Debug A System That Thinks In Tokens?

Normal Logs Are Not Enough

What You Need To Capture

Traces Tell The Story

A Simple Trace Shape

Prompt And Context Versioning

Pro Tips

Cost And Latency Are Product Signals

Metrics Worth Tracking

Evals Are Your Regression Tests

Common Eval Types

Final Tips

Let’s make something great together

Links

Contacts

Normal Logs Are Not Enough

What You Need To Capture

Traces Tell The Story

A Simple Trace Shape

Prompt And Context Versioning

Pro Tips

Cost And Latency Are Product Signals

Metrics Worth Tracking

Evals Are Your Regression Tests

Common Eval Types

Final Tips

You might also like

Why AI Apps Need Observability From Day One

Building AI Guardrails Into Development Workflows

Observability For AI Features In Production

Let’s make something great together