Free tiers are great until you have real traffic. Once you're past prototyping and serving users, you're paying per token — and the difference between a naive implementation and a cost-aware one is routinely 10x, sometimes more. Same outputs, a tenth of the bill.
This is a practical playbook for getting your production LLM cost as low as it can go without wrecking quality. Prices below are as of June 2026 and move constantly — always confirm against the provider's current pricing page before you commit.
1. Pick the genuinely cheap models (and know the floor)
The single biggest lever is which model you call. The bottom of the market in mid-2026 is remarkably cheap:
- Gemini Flash-Lite class — among the cheapest proprietary options, reported around $0.10 / $0.40 per million tokens (input/output) for the lite tier.
- DeepSeek — a V-series model reported around $0.14 / $0.28 per million tokens, with very large context windows.
- Hosted Llama (Maverick-class) — reported around $0.15 / $0.60 per million tokens for a capable open model.
- Small GPT "nano"-class models — cheap on standard pricing and very cheap with batch discounts (see below).
The lesson: a frontier model can cost 20–50x what a capable cheap model costs. Most production calls — classification, extraction, routing, summarization, structured output — do not need frontier quality. Run an eval and use the cheapest model that passes it.
The FreeAIRouter directory tracks which providers and aggregators surface these cheap models and how they're categorized, including high-volume providers like SiliconFlow and aggregators such as OpenRouter that let you price-shop across many models behind one endpoint.
2. Right-size: don't pay frontier prices for a coin-flip
The most expensive mistake in production is sending every request to your best model "to be safe." A cheaper pattern:
- Route by difficulty. Send the easy 80% to a cheap model; escalate only the hard 20% to an expensive one. A small classifier (or even a heuristic) decides which is which.
- Cap output tokens. Output is usually 2–4x the price of input. If you only need a one-word label, set
max_tokenslow — an unbounded response can cost more than the entire input. - Trim input. Don't resend an entire document when a relevant chunk will do. RAG that retrieves 2K relevant tokens beats stuffing 100K tokens into the prompt every call.
3. Use batch APIs for anything not real-time
If a request doesn't need an answer right now, don't pay real-time prices. Most major providers offer a batch / asynchronous tier at roughly half price. As of June 2026, a small GPT nano-class model with the 50% batch discount drops to around $0.05 / $0.20 per million tokens — about as cheap as capable inference gets — and Gemini's batch runs at half its standard rate across the line.
Batch is ideal for: nightly enrichment, document processing pipelines, embeddings backfills, eval runs, and any "process this queue by morning" workload. The only cost is latency (minutes to hours), which most async jobs can absorb happily.
4. Cache aggressively — prompt caching is the quiet 10x
If your prompts share a large, stable prefix — a long system prompt, a fixed instruction block, a reference document reused across calls — prompt caching lets the provider charge a fraction of the input price for the cached portion. For agentic loops that resend the same instructions every turn, this is enormous: Gemini's cached input rate, for example, has been reported around $0.15/M for a Flash-class model, and Anthropic's prompt caching similarly slashes the cost of repeated context.
Two layers to run:
- Provider-side prompt caching for the repeated prefix.
- Application-side response caching for identical requests — the cheapest call is the one you never make. Coalesce duplicate in-flight requests too.
5. Combine free tiers with paid as a cost floor
Production doesn't have to mean 100% paid. A tiered routing strategy uses free quota first and only pays for overflow:
- Serve from a free tier (e.g. Gemini Flash, or a free Llama host) up to its daily limit.
- Fall over to a cheap paid model when the free quota is exhausted or the request is latency-critical.
- Reserve the expensive model for the small slice that truly needs it.
This is the same fallback-ladder idea from our guide to avoiding rate limits on free LLM APIs — here it doubles as a cost strategy, not just a reliability one. Done well, a meaningful fraction of production traffic can ride free tiers indefinitely.
6. Measure cost per outcome, not per token
The cheapest model per token isn't always the cheapest in practice. If a cheap model needs three retries, a longer prompt, or a validation pass to get usable output, a slightly pricier model that nails it first try can be cheaper per successful result. Track:
- Cost per completed task (including retries).
- Failure/retry rate by model.
- Tokens per task (input + output).
Optimize that number, not the headline per-token price.
The cost-cutting checklist
- Right model — cheapest that passes your eval; route by difficulty.
- Cap outputs, trim inputs — output is the pricey side; don't overstuff context.
- Batch the non-real-time work — ~50% off.
- Cache the repeated prefix and identical responses.
- Tier free quota → cheap paid → frontier.
- Measure cost per outcome, not per token.
Stack all six and a bill that would've been hundreds of dollars a month often lands in the low tens. For the free end of the spectrum, see which LLMs have a free tier in 2026 and browse current providers in the FreeAIRouter directory.
