Observability For AI Features In Production

AI features are different from normal application features. A normal API endpoint usually has clear behavior: you send input, you get output, and you can test exact values. You can monitor latency, errors, CPU, memory, database queries, and queue failures. The contract is tight, and when it breaks, the failure has a shape you can recognize.

AI features are messier. The same user question can produce slightly different answers. The model may call a tool, retrieval may return weak documents, the prompt may grow too large, and token cost can spike without warning. A small prompt change may quietly reduce quality, and a model upgrade may improve one workflow while breaking another.

So observability for AI is not only "did the request fail?" You also need to ask: did the model receive the right context, did retrieval find the right documents, did the model call the right tools? Was the answer useful, was it safe, was it too slow, was it too expensive? Could we debug this later? If you cannot answer these questions, your AI feature is not production-ready yet.

AI Request Lifecycle flow diagram: user request flows through prompt builder, retrieval, model, tool calls, response, and feedback, with observability probes at each step.

Start with the AI request lifecycle

Before adding dashboards, define the lifecycle of one AI request. A common AI feature looks like this:

Text

User input
  -> validation
  -> prompt construction
  -> optional retrieval
  -> model call
  -> optional tool call
  -> response validation
  -> response shown to user
  -> user feedback

Each step can fail in a different way. Input validation can fail because the user asks for unsupported behavior. Prompt construction can fail because the template is missing variables. Retrieval can fail because documents are missing, stale, or irrelevant. The model call can fail because of latency, provider errors, rate limits, or poor output. Tool calls can fail because external APIs fail. Response validation can fail because the answer does not match the required schema. And user feedback can reveal that the answer was technically valid but not useful.

That means your logs should not only say:

JSON

{
  "status": "success"
}

They should tell the story of the request.

What to log for prompts

Prompt logs are useful, but they must be handled carefully. Prompts can contain personal data, customer data, secrets, internal documents, or sensitive business information. Do not blindly log everything forever. A safe approach is to log structured metadata by default and store full prompts only in controlled environments or with redaction.

Example prompt log:

JSON

{
  "request_id": "ai_req_01HX9",
  "feature": "support_reply_assistant",
  "prompt_template": "support_reply_v4",
  "template_version": 4,
  "system_prompt_hash": "sha256:8f91...",
  "user_prompt_length": 1840,
  "final_prompt_tokens": 3120,
  "redaction_applied": true,
  "created_at": "2026-05-03T18:40:12Z"
}

Notice the hash. You do not always need to store the entire system prompt in every log; a hash plus a version is often enough to connect a production request to the exact prompt that produced it.

For debugging, you may also store redacted prompt snapshots:

TypeScript

function redactPrompt(prompt: string): string {
  return prompt
    .replace(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, '[EMAIL]')
    .replace(/\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/g, '[CARD_NUMBER]')
    .replace(/sk-[A-Za-z0-9_-]+/g, '[API_KEY]');
}

This is simple, not perfect. In real systems, redaction should be layered and tested. But the principle is important: observability should not become a data leak.

What to log for retrieval

RAG features need retrieval observability. If an AI answer is bad, the model may not be the problem. The retrieved context may be weak, the index may be stale, or the user's question may sit outside the knowledge base entirely. Log what retrieval returned:

JSON

{
  "request_id": "ai_req_01HX9",
  "retrieval": {
    "index": "support_docs_prod",
    "query": "refund after annual renewal",
    "top_k": 5,
    "documents": [
      {
        "doc_id": "refund-policy-2026",
        "title": "Refund Policy",
        "score": 0.82,
        "version": "2026-04-12"
      },
      {
        "doc_id": "billing-faq",
        "title": "Billing FAQ",
        "score": 0.64,
        "version": "2026-02-01"
      }
    ]
  }
}

This helps you answer practical questions. Did we retrieve the correct policy? Was the document stale? Did the top result have a low score? Did the user ask a question outside the knowledge base? You can also track retrieval quality over time:

TypeScript

type RetrievalMetric = {
  query: string;
  topScore: number;
  resultCount: number;
  clickedDocumentId?: string;
  userFeedback?: 'helpful' | 'not_helpful';
};

function detectWeakRetrieval(metric: RetrievalMetric): boolean {
  return metric.resultCount === 0 || metric.topScore < 0.65;
}

Weak retrieval should be visible. Otherwise, teams blame the model when the actual issue is missing content.

Technical RAG observability diagram: a user question hits a vector search index, returns top-k documents with similarity scores, version stamps, and freshness badges, then feeds the model with warning icons on stale or low-similarity results.

What to log for tool calls

AI agents often call tools: database lookups, internal APIs, search services, code runners, ticket systems, or deployment systems. Tool calls need serious observability because they can change real systems.

Log the tool name, the input schema version, the sanitized arguments, the result status, the duration, the retry count, the authorization context, and whether the tool was read-only or write-enabled. Example:

JSON

{
  "request_id": "ai_req_01HX9",
  "tool_call": {
    "tool": "get_customer_subscription",
    "tool_version": "v2",
    "mode": "read_only",
    "duration_ms": 182,
    "status": "success",
    "arguments_redacted": {
      "customer_id": "cus_[REDACTED]"
    }
  }
}

For write tools, add extra guardrails:

JSON

{
  "tool": "cancel_subscription",
  "mode": "write",
  "requires_human_confirmation": true,
  "confirmation_id": "confirm_7831",
  "executed": false
}

A production AI assistant should not casually execute dangerous actions because the model "thought it was right." Read-only tools are safer. Write tools need approvals, audit logs, permissions, and rollback plans.

Latency: measure the full path, not only the model

AI latency is often multi-part. A slow response may include prompt building, retrieval, model generation, tool calls, response validation, streaming delay, and frontend rendering. Track each part separately:

TypeScript

type AiTiming = {
  requestId: string;
  promptBuildMs: number;
  retrievalMs: number;
  modelMs: number;
  toolMs: number;
  validationMs: number;
  totalMs: number;
};

function logTiming(timing: AiTiming): void {
  console.log(JSON.stringify({
    event: 'ai_timing',
    ...timing,
  }));
}

A dashboard should show p50, p95, and p99 latency. Average latency hides pain; if most requests finish in two seconds but 5% take thirty, users will notice the long tail before any chart does.

For streaming responses, also track time to first token:

JSON

{
  "request_id": "ai_req_01HX9",
  "time_to_first_token_ms": 740,
  "time_to_complete_ms": 8420
}

Time to first token matters because users feel the product is alive when streaming starts quickly.

Token usage and cost

Token usage is not just a billing detail. It is a product health signal. A feature can become expensive because prompts include too much irrelevant context, retrieval returns too many long chunks, conversation history is not summarized, the model is too powerful for a simple task, agents call each other repeatedly, or retries happen silently. Log token usage per feature:

JSON

{
  "feature": "pr_summary_assistant",
  "model": "example-large-model",
  "input_tokens": 18420,
  "output_tokens": 1620,
  "cached_input_tokens": 9000,
  "estimated_cost_usd": 0.084,
  "request_id": "ai_req_45KQ"
}

Then create cost dashboards by feature, customer, team, or workflow. A useful metric is cost per successful outcome, not only cost per request. For example:

Text

support_reply_assistant
- 10,000 requests
- $420 model cost
- 6,800 helpful responses
- cost per helpful response = $0.061

That is much more useful than saying "we spent $420."

Failed generations and schema validation

AI output can fail even when the API request succeeds. Maybe the response is not valid JSON, maybe it misses required fields, maybe it includes text when your application expects structured data. Use schema validation:

TypeScript

import { z } from 'zod';

const PrSummarySchema = z.object({
  summary: z.string().min(20),
  changedBehavior: z.array(z.string()),
  riskyFiles: z.array(z.string()),
  testsRun: z.array(z.string()),
  missingTests: z.array(z.string()),
});

function parsePrSummary(raw: unknown) {
  const result = PrSummarySchema.safeParse(raw);

  if (!result.success) {
    throw new Error(`Invalid AI response schema: ${result.error.message}`);
  }

  return result.data;
}

Then log validation failures:

JSON

{
  "event": "ai_response_validation_failed",
  "feature": "pr_summary_assistant",
  "template_version": 3,
  "model": "example-large-model",
  "error": "missing required field: testsRun"
}

This lets you detect prompt regressions quickly.

Production observability dashboard mockup for AI features: panels for p95 latency, token cost, retrieval quality, tool failures, schema validation errors, and user feedback on a dark editorial background.

User feedback is part of observability

AI quality is not only technical. Users can tell you when an answer was helpful, wrong, too long, unsafe, or irrelevant. Do not collect only thumbs up/down; add lightweight reason categories:

TypeScript

type AiFeedback = {
  requestId: string;
  rating: 'positive' | 'negative';
  reason?:
    | 'incorrect'
    | 'missing_context'
    | 'too_verbose'
    | 'unsafe'
    | 'not_actionable'
    | 'other';
  comment?: string;
};

This feedback can feed evaluation datasets. If users repeatedly mark answers as missing_context, the problem may be retrieval. If they mark answers as too_verbose, the prompt may need tighter formatting rules. If they mark answers as incorrect, you need deeper analysis: bad prompt, bad context, weak model, ambiguous user input, or missing business rule.

Evaluation and regression testing

Production observability tells you what happened. Evaluations help you prevent known failures from coming back. Create a small dataset of realistic cases:

JSON

[
  {
    "id": "refund_annual_plan_001",
    "input": "Can I get a refund if my annual plan renewed yesterday?",
    "expected_traits": [
      "mentions refund window",
      "does not promise refund automatically",
      "asks for account details if needed"
    ],
    "forbidden_traits": [
      "invented policy",
      "asks for full credit card number"
    ]
  }
]

You can run this dataset whenever you change the prompt template, the retrieval index, the model, the tool definitions, the system instructions, or the response schema. The goal is not perfect testing. The goal is catching obvious regressions before users do.

A simple AI observability schema

Here is a practical event model:

TypeScript

type AiEvent =
  | {
      type: 'ai.request.started';
      requestId: string;
      feature: string;
      userId?: string;
      promptTemplate: string;
    }
  | {
      type: 'ai.retrieval.completed';
      requestId: string;
      topScore: number;
      resultCount: number;
      documentIds: string[];
    }
  | {
      type: 'ai.model.completed';
      requestId: string;
      model: string;
      inputTokens: number;
      outputTokens: number;
      durationMs: number;
    }
  | {
      type: 'ai.tool.completed';
      requestId: string;
      toolName: string;
      status: 'success' | 'failure';
      durationMs: number;
    }
  | {
      type: 'ai.response.validation_failed';
      requestId: string;
      error: string;
    }
  | {
      type: 'ai.feedback.received';
      requestId: string;
      rating: 'positive' | 'negative';
      reason?: string;
    };

You can send these events to your normal observability stack. The exact vendor matters less than consistency.

Privacy and retention

AI observability can collect sensitive information if you are not careful. Set clear rules: redact secrets before logging, avoid storing raw prompts by default, set retention limits, separate debugging access from general analytics access, record prompt template versions, store document IDs instead of full documents when possible, and audit access to AI traces. This is especially important for internal assistants that can read customer support tickets, invoices, medical records, legal documents, or private engineering docs.

Final thoughts

AI features need observability because they are probabilistic, context-sensitive, and expensive. You need more than HTTP 200 and p95 latency. You need to see prompts, retrieval, tool calls, tokens, cost, validation failures, user feedback, and evaluation results.

The best AI observability systems do not only help you debug failures. They help you improve the product. They show where retrieval is weak, where prompts waste context, where model upgrades changed behavior, and where users do not trust the answer. That is the difference between an AI demo and an AI product. A demo only needs to work once. A product needs to keep working tomorrow.

Observability For AI Features In Production

Start with the AI request lifecycle

What to log for prompts

What to log for retrieval

What to log for tool calls

Latency: measure the full path, not only the model

Token usage and cost

Failed generations and schema validation

User feedback is part of observability

Evaluation and regression testing

A simple AI observability schema

Privacy and retention

Final thoughts

Further reading

Let’s make something great together

Links

Contacts

Start with the AI request lifecycle

What to log for prompts

What to log for retrieval

What to log for tool calls

Latency: measure the full path, not only the model

Token usage and cost

Failed generations and schema validation

User feedback is part of observability

Evaluation and regression testing

A simple AI observability schema

Privacy and retention

Final thoughts

Further reading

You might also like

AI Observability: How Do You Debug A System That Thinks In Tokens?

Why AI Apps Need Observability From Day One

AI Code Review: Helpful Assistant Or False Confidence Machine?

Let’s make something great together