It usually starts as a script. You paste an OpenAI key into a .env.local, call generateText, and watch a coherent paragraph appear in your terminal. You wrap it in a Next.js route handler, hook it up to a textarea, and demo it on the team Slack. People react with the fire emoji. You ship a Loom video.
Then you put it in front of real users and the seams show within a day. The model returns JSON with a stray comment block, and JSON.parse throws. Someone pastes a five-megabyte transcript and you get a 400 from OpenAI about token limits. The request takes 12 seconds, the user clicks "Generate" four times, and your monthly bill has a new shape. None of this is the model being broken. It's the model being a non-deterministic remote service that you wrapped like a normal API call.
This is the gap between "AI demo" and "AI feature." Most of it is closed with the same TypeScript habits you'd apply to any external integration — boundary validation, structured failure modes, retry semantics, idempotency. The model just happens to be where the chaos lives.
LLMs Are Not APIs, They Are Untrusted Subprocesses
A normal API has a contract. A GET /users/:id returns either a user shape or an error you can branch on. You can write a type for it once and call it a day.
An LLM has a tendency. You ask for JSON; you usually get JSON. Sometimes you get JSON wrapped in triple backticks. Sometimes you get an apology before the JSON. Sometimes you get the JSON with an extra field you never asked for, because the model thought it would be helpful.
The right mental model is: an LLM call is a subprocess running untrusted code that returns a string. Every byte of that string crosses a trust boundary on the way back into your app. Your job, in TypeScript, is to put a real boundary there — not a comment that says "// TODO: validate".
Validate At The Boundary With Zod 4 And generateObject
If you need structured output, do not ask the model for JSON in a prompt and then JSON.parse it. The Vercel AI SDK exposes generateObject (and streamObject) for exactly this — you give it a Zod schema and it handles the model-side instructions, parsing, and repair logic for you.
import { generateObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
const Profile = z.object({
name: z.string().min(1),
skills: z.array(z.string()).max(20),
seniority: z.enum(["junior", "mid", "senior"]),
contactEmail: z.email().optional(),
});
export type Profile = z.infer<typeof Profile>;
export async function extractProfile(bio: string): Promise<Profile> {
const { object } = await generateObject({
model: openai("gpt-4o-mini"),
schema: Profile,
prompt: `Extract a structured profile from this bio:\n\n${bio}`,
});
return object;
}
A few things worth pointing out. gpt-4o-mini is the right default for extraction work — it's fast, cheap, and good enough that paying ten times more for gpt-4o is rarely justified. z.email() is the Zod 4 spelling (the old z.string().email() still works but is being deprecated in favor of the dedicated string formats). And generateObject will throw a typed error if the model can't produce valid JSON for your schema after retries — you don't have to chase trailing commas yourself.
When You Do Need To Parse Free Text, Use safeParse
Sometimes you're dealing with a model response that's mostly natural language with one extracted field, or you're calling a provider that doesn't support structured outputs cleanly. In that case, parse defensively:
const Result = z.object({ summary: z.string(), tags: z.array(z.string()) });
const parsed = Result.safeParse(JSON.parse(rawText));
if (!parsed.success) {
// structured error you can log or surface to retry logic
throw new Error("model output failed schema: " + JSON.stringify(z.flattenError(parsed.error)));
}
return parsed.data;
safeParse returns a discriminated union instead of throwing, which composes cleanly with whatever error envelope your route handler uses. z.flattenError (Zod 4) gives you a serializable view that's safe to log.
Streaming Is The Real Latency Fix
The honest answer to "the model takes ten seconds" is not "make it faster." It's "show the user something within 200ms." Streaming is non-negotiable for chat-style features. streamText from the Vercel AI SDK gives you a Response object you can return directly from a Next.js route handler:
import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: openai("gpt-4o"),
system: "You are a concise product help assistant.",
messages,
});
return result.toUIMessageStreamResponse();
}
On the client side, useChat from @ai-sdk/react consumes that stream, manages the message array, and gives you a controller you can wire to a form. The user sees tokens within a few hundred milliseconds. The total time to finish doesn't change, but the perceived latency drops by an order of magnitude.
Background Work For Anything Over 30 Seconds
For long-running generation — analyzing a PDF, regenerating an entire document, multi-step agent runs — streaming buys you nothing because the user shouldn't be staring at a tab anyway. Accept the request, return a 202 Accepted with a job ID, run the work in a queue (Inngest, BullMQ, Cloudflare Queues, whatever your stack uses), and notify the UI over SSE or a poll. This also lets your retry policy live somewhere sane instead of inside a request handler that times out at 60 seconds.
Retries Need Backoff, Jitter, And A Budget
OpenAI 429s exist. Anthropic 529s exist. Network blips exist. A naive retry loop turns one bad minute into a stampede. The pattern that actually behaves:
async function withRetry<T>(op: () => Promise<T>, opts = { max: 3, baseMs: 500 }) {
let lastErr: unknown;
for (let attempt = 0; attempt < opts.max; attempt++) {
try {
return await op();
} catch (err) {
lastErr = err;
if (!isRetryable(err) || attempt === opts.max - 1) throw err;
const backoff = opts.baseMs * 2 ** attempt;
const jitter = Math.random() * backoff;
await new Promise((r) => setTimeout(r, backoff + jitter));
}
}
throw lastErr;
}
The jitter matters more than the backoff. Without it, every client that hit the same 429 retries at exactly the same moment two seconds later, and the rate limiter sees the same wave again. isRetryable is a small predicate — 429, 5xx, network errors, yes; 400, 401, 422, no.
Idempotency Stops The Quadruple-Click Bug
The user clicks "Generate" four times because the spinner doesn't feel like progress. Without idempotency, you just paid for four completions and stored four results. Add a client-supplied key:
const key = crypto.randomUUID(); // generated once when the form mounts
await fetch("/api/generate", {
method: "POST",
headers: { "Idempotency-Key": key, "content-type": "application/json" },
body: JSON.stringify({ input }),
});
On the server, before you call the model, check whether you've already produced a result for that key. If yes, return the cached result. If no, run the model and store the result against the key with a sensible TTL (an hour is usually enough). Redis with SET NX EX is the boring, correct primitive for this.
Cost And Token Limits Are Product Constraints, Not Footnotes
A free-tier user shouldn't be able to burn $40 of completions in an afternoon. Cap input tokens before you call the model, not after. The provider will tell you the input was too large via a 400, but you've already spent the round-trip and your error logs are now full of preventable errors. Use a tokenizer (tiktoken for OpenAI, Anthropic's messages.countTokens endpoint via @anthropic-ai/sdk for Claude) to measure and truncate the prompt yourself, and set maxTokens on the call so a runaway response doesn't keep going.
A One-Sentence Mental Model
A production AI feature is just a normal feature where the slowest, least predictable dependency happens to live behind the most expensive HTTPS call you make — treat it accordingly, validate at the boundary, stream what you can, and never let the model decide your retry policy.






