Introduction
Running LLMs in production is expensive. Sora reportedly burns through roughly $1M per day in compute. ScaleOps just raised $130M at an $800M valuation largely because enterprises are hemorrhaging money on mismanaged GPU workloads — not because they lack hardware, but because they're wasting what they have.
The ROI gap is real: inference costs scale faster than revenue for most AI products. If you're serving any meaningful volume of LLM requests, compute efficiency isn't an infrastructure detail. It's a business constraint.
This guide covers the five highest-leverage techniques to cut inference costs at scale: quantization, speculative decoding, smart batching, KV-cache strategies, and model routing. These aren't theoretical — they're what production teams at companies like Ring, NVIDIA, and others are shipping right now.

Five Techniques That Actually Move the Needle
1. Quantization: Shrink the Model, Keep the Signal
Full-precision (FP32) weights are rarely necessary at inference time. Quantizing to INT8 or INT4 cuts memory bandwidth and speeds up token generation — often with less than 1% quality degradation on most benchmarks.
How to apply it: Use GPTQ or AWQ for post-training quantization. For deployment, llama.cpp and vLLM both support quantized models natively. Start with INT8; move to INT4 only if memory pressure demands it and you've measured quality on your specific task.
Unlike pruning, which permanently removes model capacity, quantization is reversible and easier to validate. Run your evals against a golden dataset before promoting a quantized model to production — the same pattern Ring uses with Amazon Bedrock Knowledge Bases to compare knowledge base versions before serving customers.
2. Speculative Decoding: Parallelize the Autoregressive Bottleneck
Autoregressive generation is inherently sequential — each token depends on the last. Speculative decoding breaks this constraint by using a small "draft" model to propose several tokens at once, then verifying them in parallel with the large model.
The payoff: 2–3x throughput improvement with zero change to output quality, because rejected tokens trigger a fallback to normal generation. The large model's output distribution is preserved exactly.
What you need: A draft model roughly 10–20x smaller than your target model (e.g., a 1B model drafting for a 70B). The draft model must share the same tokenizer and vocabulary. This pairs naturally with vLLM's speculative_config parameter.
This differs from distillation, where you permanently transfer knowledge into a smaller model. Speculative decoding keeps the full model in the loop — critical when output quality is non-negotiable.
3. Batching Strategies: Don't Leave GPU Cycles on the Table
Static batching waits for a fixed batch size before processing. At variable traffic loads, this either wastes GPU cycles (batch too small) or adds latency (waiting to fill the batch). Continuous batching — now the default in vLLM and TGI — solves this by inserting new requests into an in-flight batch as slots free up.
Practical steps:
- Enable continuous batching in your inference server
- Tune max_batch_size and max_num_seqs based on your GPU memory and latency SLO
- Monitor batch utilization metrics; under-utilization is often the first sign of a misconfigured deployment
NVIDIA's ProRL Agent infrastructure takes this further for RL training workloads, using asynchronous three-stage pipelines (init, run, eval) with independent worker pools to prevent slow evaluation jobs from stalling throughput. The principle applies broadly: decouple stages with different resource profiles.

4. KV-Cache Optimization: Stop Recomputing What You Already Know
The key-value cache stores attention computations for previously processed tokens. Without it, every request recomputes the full context from scratch. With it, only new tokens require computation.
Where teams leave money on the table: Shared system prompts. If your system prompt is 500 tokens and you're serving 10,000 requests per hour, you're recomputing 5 million tokens unnecessarily. Prefix caching pins shared prompt prefixes in memory and reuses them across requests.
NVIDIA's ProRL Agent explicitly implements this — routing all turns of a given task to the same inference backend to maximize prefix cache hits. The same principle applies to any application with a fixed or semi-fixed preamble.
Implementation: vLLM supports automatic prefix caching via enable_prefix_caching=True. For RAG workloads like Ring's support chatbot, cache the retrieval-augmented prompt template separately from dynamic context.
5. Model Routing: Match Task Complexity to Model Size
Not every request needs GPT-4-class reasoning. A question like "What are your store hours?" doesn't warrant the same compute as "Summarize these 50 legal documents and flag liability clauses."
Build a two-tier router:
1. Classify incoming requests by complexity (token length, presence of reasoning keywords, task type)
2. Route simple requests to a small, fast model (7B–13B); escalate complex ones to your large model
This differs from pure distillation or quantization because routing preserves full capability for hard tasks while slashing cost on easy ones. A well-tuned router can reduce large-model invocations by 40–60% in customer-facing applications.
Practical starting point: Train a lightweight classifier on a sample of your production traffic, labeled by which model tier actually handled the request correctly.

What This Means
- For developers: These aren't sequential steps — combine them. Quantize your models, enable prefix caching, configure continuous batching, and add a routing layer. Each technique compounds. Start with batching and caching (lowest implementation cost, immediate gains) before moving to quantization and routing.
- For founders: Inference cost is now a product decision. Architecture choices made at prototype stage (fixed vs. dynamic routing, model tier selection) directly shape your unit economics at scale. Build cost observability into your stack from day one.
- For platform teams: ScaleOps' core insight — that Kubernetes static configurations are fundamentally mismatched to dynamic AI workloads — applies equally to LLM serving infrastructure. If your inference configuration doesn't adapt to traffic patterns automatically, you're leaving efficiency on the table regardless of which optimizations you've implemented at the model level.
The deeper shift here is treating inference as a systems engineering problem, not just an ML problem. The teams closing the ROI gap aren't finding cheaper GPUs — they're stopping the waste on the ones they already have.