Architecture des systèmes IA — Maîtrise6 / 9

Cost Engineering — Token Budgets That Hold

An AI feature that delights at 100 users can bankrupt you at 100,000. Cost is an architectural constraint, designed in — not discovered on the invoice.

Publié le 15 mai 20261 min de lectureHaythem Rehouma · Claude Mastery

Traditional software gets cheaper per user as you scale. AI software gets more expensive — every request costs tokens. If unit economics aren't designed in, growth is the thing that kills you.

Budget per request

Decide, per feature, a token budget the way you'd cap DB queries. Know the input + output token cost of a typical request and the worst case. "Cost per request × requests/month" is a spreadsheet you can fix before it's an invoice you can't.

Model tiering

Not every step needs your best model. Use a cheap, fast model for routing, classification, query rewriting, and faithfulness checks; reserve the expensive model for the step where quality is the product. This is often a 2–5x cost cut at equal quality.

Cache everything cacheable

Prompt/response cache for stable, repeated requests.
Prompt caching (provider-side) for the large, unchanging prefix of a prompt.
Retrieval cache so popular queries don't re-search.

A cache hit is a near-free request.

Trade quality for cost deliberately

Coûts maîtrisés. Ensuite : faire vite — latence et débit à l'échelle.

Budget per request

Model tiering

Cache everything cacheable

Trade quality for cost deliberately

Skills Claude reliés à installer

Partager cet article

Série — Architecture des systèmes IA — Maîtrise

Continuer

Le cours Claude Mastery