How to Cut Your AI API Bill 70% in 2026 (Without Touching Quality)

AI API bills can grow 10× in a month without code changes. Here's a battle-tested playbook to cut them by 70% without measurable quality loss.

If your AI API bill jumped this quarter and you can't explain why, you're not alone. The 2026 model lineup is the most expensive-on-the-surface generation we've shipped — flagship reasoning models from OpenAI, Anthropic, Google, and xAI now routinely run $5–$30 per million input tokens and $25–$180 on output. But the dirty secret of 2026 LLM economics is that 70% of most bills is waste — caching that isn't turned on, models that are oversized, output that's uncapped, and chat history that's being re-sent on every turn.

This is a focused playbook of the five highest-leverage changes you can make this week, ordered by how much money each one typically saves. None of them require a model swap or a quality regression — they're pure efficiency.

The 5-minute audit that finds your biggest leak

Before tuning anything, run this audit. Pull your last 30 days of usage data from your provider dashboard (OpenAI Platform, Anthropic Console, or Google AI Studio) and answer four questions:

What % of your tokens are being served by your most expensive model? If it's over 20%, you have a routing problem.
What's your cached-input ratio? If it's under 30% on a chatbot or RAG workload, you're leaving 50–90% on the table.
What's your average output-to-input ratio? If it's over 0.3, you probably have uncapped responses.
How much of your spend is on conversation tokens (re-sent history)? Most teams discover this is their #1 line item.

These four numbers point directly at the five plays below.

1. Route by complexity — the #1 lever

Most production LLM workloads are 80/20: 80% of requests are simple (classification, routing, structured extraction, short chat), 20% are hard (multi-step reasoning, long-form research, complex coding refactors). If you're sending everything to GPT-5.5 or Claude Opus 4.7, you're paying the flagship rate for work that a $0.25/M model handles equally well.

The pattern. Use a small, cheap "router" model to triage incoming requests, then send each to the appropriate tier:

Hard reasoning, multi-step research, complex code refactors → GPT-5.5 Pro, Claude Opus 4.7, Gemini 3.1 Pro
Production chat, coding, RAG, JSON extraction → Claude Sonnet 4.6, GPT-5, Gemini 3.5 Flash, Grok 4.3
Classification, tagging, simple extraction, routing decisions → GPT-5 nano, Gemini 3.1 Flash-Lite, Cohere Command R7B

The price spread is enormous. Cohere Command R7B is 4,800× cheaper per token than GPT-5.5 Pro. Even a single tier-down — GPT-5.5 → GPT-5, or Opus → Sonnet — usually halves your bill on the affected workload.

Real example. A friend was routing every customer support ticket through Opus 4.7 ($5/$25 per 1M). Adding a 50-token GPT-5 nano triage step that classified tickets into "needs reasoning" vs "needs lookup" cut Opus volume by 78% with no measurable change in CSAT.

Use the calculator to see the actual dollar impact across your usage profile.

2. Turn on prompt caching (every provider has it now)

Every frontier provider in 2026 supports prompt caching, and almost nobody uses it correctly.

The economics. Cached input tokens cost 80–90% less than fresh ones across the board:

Provider	Fresh input	Cached input	Discount
OpenAI GPT-5.5	$5.00 / 1M	$0.50 / 1M	90%
Claude Sonnet 4.6	$3.00 / 1M	$0.30 / 1M	90%
Gemini 3.5 Flash	$1.50 / 1M	$0.15 / 1M	90%

What to cache. Anything that doesn't change between requests:

System prompts (every chatbot has a 500–2,000 token system message that's identical on every turn)
Few-shot examples and tool definitions
Reference documents in RAG (cache the retrieved chunks for repeat queries on the same document)
The first N turns of long conversations

The gotcha. Most providers price cache writes slightly higher than fresh tokens (1.25× for 5-minute TTL, 2× for 1-hour TTL on Anthropic) and the cache must hit a minimum size (1,024 tokens on Anthropic, varies elsewhere). For high-traffic endpoints, the 5-minute TTL pays back in under a minute.

If your cached-input ratio is under 30% and you're running anything chat-shaped, this is the single change with the highest payback per hour of work.

3. Use batch / flex pricing for anything that can wait

Every major provider offers a "process within a few minutes/hours" tier at 50% off. Almost nobody uses it.

The economics:

Provider	Standard	Batch / Flex	Savings
OpenAI (Batch API)	full price	50% off	50%
Anthropic (Batch API)	full price	50% off	50%
Google Gemini (Flex / Batch)	full price	50% off	50%

What's safe to batch:

Nightly summarization runs
User-uploaded document processing where a 2-minute wait is acceptable
Bulk classification and labeling
Eval suites and offline testing
Background tasks that aren't user-facing

A team I worked with was running their entire eval suite on the standard tier ($340/month). Moving it to the Batch API dropped it to $170 with no code changes — just a different endpoint and a webhook for completion.

4. Cap your output tokens

Output tokens cost 4–6× more than input tokens across every 2026 model. A chatty model with no max_tokens set is the single most common source of runaway bills.

The data. On a typical chat workload at 750 input + 250 output tokens per turn (the calculator's defaults), output is 47% of your bill with Claude Sonnet 4.6 — already significant. Let an Opus 4.7 model ramble to 2,000 tokens unprompted and output becomes 89% of the bill for that turn.

The fix:

Set max_tokens (or max_completion_tokens on newer OpenAI endpoints) explicitly on every request. 500–800 is a sane default for chat; bump for code generation.
Use a system instruction like "Be concise. Use bullet points. Stop when the answer is complete."
For structured outputs, use the provider's JSON / structured-output mode — it's both faster and shorter.

Total cost of this change: 30 seconds. Typical savings: 15–30% of the output line item.

5. Stop re-sending the entire conversation

Because LLMs are stateless, every chat turn re-sends the full conversation history. By turn 20 of a chat, you might be paying for 15,000 input tokens of context that adds almost nothing to the answer.

The patterns that work:

Sliding window. Keep the last N turns (4–8 for most chat) plus the system prompt; drop the rest.
Summarize-and-discard. Every M turns, run a cheap summarization pass (GPT-5 mini or Haiku 4.5) and replace the older turns with a 200-token summary.
Tool memory. If the conversation history is mostly "I asked the model to look something up and it returned data," store the result and discard the call. Cheaper and the model often performs better with cleaner context.

This is the #1 line-item killer of 2026 chatbots. One product I audited was spending $4,200/month on a Claude Sonnet bot; 62% of that bill was re-sent chat history on conversations longer than 10 turns. Switching to a sliding window of the last 6 turns + a one-paragraph summary cut total spend to $1,540 with zero change in user-reported quality.

Putting it together

You don't need all five changes. Most teams can hit the 70% reduction with just two:

Prompt caching on the static parts of every request.
Routing so flagship models only see the requests that actually need them.

If you do all five, expect 60–80% off your current bill with no quality loss. The exact number depends on which leaks dominate yours — run the audit above first.

When you're done, use the calculator to model the new bill and share it with finance — they'll like the numbers.