You spend a Friday afternoon "improving the prompt." You add a STEP BY STEP, a few BOLD WORDS, an example output, the phrase "be helpful and accurate." The eval moves from 71% to 73%. You celebrate. You ship.
Then a colleague asks why the assistant doesn't know about the customer's last invoice. You realize you never gave it that information — the prompt was beautifully phrased and gracefully formatted, but the actual data the model needed wasn't in the request at all. The eval suite didn't catch it because the eval suite tested the prompt, not the pipeline.
This is the difference between prompt engineering and context engineering. Prompt engineering is the wording; context engineering is the data pipeline that decides what wording even has a chance of working. In production, the pipeline is where 80% of the wins live, and TypeScript is genuinely a good place to build it.
Tokens Are A Budget, Not An Afterthought
A model doesn't see characters or words. It sees tokens — chunks of text whose count depends on the tokenizer for that model. "Hello world" is two tokens for GPT-4o; "antidisestablishmentarianism" is six. You don't get to ignore the difference because every prompt has a hard token ceiling, every output costs money per token, and every additional token of context measurably increases latency.
Counting tokens before you call the model is what makes the rest of context engineering possible:
import { encoding_for_model } from "tiktoken";
const enc = encoding_for_model("gpt-4o");
export function tokenCount(text: string): number {
return enc.encode(text).length;
}
export function truncateToTokens(text: string, max: number): string {
const tokens = enc.encode(text);
if (tokens.length <= max) return text;
const slice = tokens.slice(0, max);
return new TextDecoder().decode(enc.decode(slice));
}
tiktoken is the canonical OpenAI tokenizer; on the server it works fine. For the browser or the edge, js-tiktoken is a slower-but-pure-JS alternative. For Claude, the most accurate option is the messages.countTokens endpoint exposed through @anthropic-ai/sdk — it runs server-side against the same tokenizer the model uses. None of the local approximations are perfect predictions of what the provider's tokenizer will count, but they're within a few percent — enough to budget around.
Build A Token Budget, Not A Concatenation
Once you know how many tokens each piece of context costs, the prompt-building code becomes a budgeting problem. You have a fixed input budget — call it 8,000 tokens to leave room for the response — and you have a list of context blocks: system instructions, user input, retrieval results, recent conversation history. Pack them in priority order and stop when you'd overflow.
type Block = { name: string; content: string; required?: boolean; priority: number };
export function packContext(blocks: Block[], budget: number): Block[] {
const sorted = [...blocks].sort((a, b) => a.priority - b.priority);
const out: Block[] = [];
let used = 0;
for (const b of sorted) {
const cost = tokenCount(b.content);
if (b.required) {
out.push(b);
used += cost;
continue;
}
if (used + cost <= budget) {
out.push(b);
used += cost;
}
}
return out;
}
The required flag exists because some blocks aren't optional — the user's actual question, the system instructions, the schema for structured output. Mark them as required, count their tokens first, and then fill the remaining budget with retrieval and history. When you blow the budget, drop the lowest-priority optional blocks instead of truncating the user's question.
Use Tags, Not Prose, To Separate Sections
Models reliably recognize structural markers. Triple backticks, XML-style tags, and explicit headers all work; mixing them inside one prompt is the part that confuses models. Pick one and stay consistent. XML tags are particularly robust:
export function buildSupportPrompt(input: { user: User; recent: Order[]; question: string }) {
return `
You are a support assistant for ShopCo. Answer using only the data provided.
If you don't have enough information, say "I don't know" rather than guessing.
<user>
id: ${input.user.id}
plan: ${input.user.plan}
name: ${input.user.name}
</user>
<recent_orders>
${input.recent.map((o) => `- #${o.id} ${o.status} $${o.totalCents / 100}`).join("\n")}
</recent_orders>
<question>
${input.question}
</question>
`.trim();
}
A few things that earn their keep here. Instructions go at the top and the constraint ("if you don't know, say so") repeats itself in placement so the model sees it close to the question. Data sits in tagged blocks the model can address by name in its reasoning. Numbers come pre-formatted — leaving raw cents in the prompt is asking the model to do arithmetic that you can do deterministically.
The Lost-In-The-Middle Problem Is Real
Long contexts have a known weakness: information placed in the middle is recalled worse than information at the start or the end. The Liu et al. paper from 2023 ("Lost in the Middle") demonstrated this on retrieval-style tasks, and the effect persists across newer models, even ones with million-token windows. The practical implication for context engineering: when you have ten retrieved documents, ranking matters. Put the most relevant document first, the second-most-relevant last, and the also-rans in between. Don't dump them in the order your vector DB returned them.
function arrangeForRecall<T>(docs: T[]): T[] {
// [doc0, doc2, doc4, ..., doc5, doc3, doc1] -> highest-relevance at the edges
const front: T[] = [];
const back: T[] = [];
docs.forEach((d, i) => (i % 2 === 0 ? front.push(d) : back.unshift(d)));
return [...front, ...back];
}
It's a small thing. It moves real eval numbers.
Retrieval Is Part Of The Prompt
If your feature uses RAG, the retrieval step is the prompt. The model can only use what you fetched. Two things that quietly destroy retrieval quality:
- Embedding the wrong query. Embedding the user's literal question often misses, because the question is short and underspecified. Run a small "query rewrite" step (cheap model, structured output) that turns "what about my refund" into "refund status for user 42 order 9182" before you embed. This is one of the highest-leverage changes you can make.
- Retrieving too much. Top-50 retrieval feels safe, but most of those documents are noise. Retrieve top-8, run a re-ranker (a cheap cross-encoder or a small LLM with a "score relevance 0-10" prompt), keep top-3. Smaller, higher-quality context beats large, mediocre context every time.
Summarize History, Don't Dump It
For multi-turn features, the conversation history grows linearly. By turn 20, you can't fit it. The fix is a rolling summary — keep the last N turns verbatim, replace the older turns with a short summary the model produced earlier:
type Turn = { role: "user" | "assistant"; content: string };
export async function rollingSummary(turns: Turn[], keep = 6) {
if (turns.length <= keep) return { summary: "", recent: turns };
const old = turns.slice(0, turns.length - keep);
const recent = turns.slice(-keep);
const { text } = await generateText({
model: openai("gpt-4o-mini"),
prompt: `Summarize this conversation in under 200 words, preserving facts:\n\n${old
.map((t) => `${t.role}: ${t.content}`).join("\n")}`,
});
return { summary: text, recent };
}
The summary is part of the next prompt as its own block (<conversation_summary>). The recent turns go in verbatim. The model gets continuity without paying for the entire transcript every turn.
Cache The Stable Parts Of The Prompt
OpenAI, Anthropic, and Google all charge less for tokens you've sent before in the same exact prefix. Anthropic exposes this as explicit prompt caching with cache_control markers. OpenAI does it automatically for prompts over a certain length. The structural implication: put the stable stuff (system instructions, schema, long static documents) at the start of the prompt, and the changing stuff (user query, recent context) at the end. You'll get cache hits for free, and the cost saving is significant — often 50-90% on the cached portion.
A One-Sentence Mental Model
Context engineering is the unglamorous part of LLM features that decides whether they work — measure tokens, budget the prompt by priority, structure context with consistent tags, rank retrieved documents for the lost-in-the-middle effect, summarize history rather than concatenating it, and put the stable parts first so caching can do its job.






