Prompt Caching Is Silently Inflating Your LLM API Costs
You added prompt caching because the docs made it sound like a straightforward cost-saver. Now your monthly bill is higher than before you turned it on, and you're not sure why. You're not alone β prompt caching has a billing model that punishes misuse quietly and rewards careful use generously.
The problem isn't caching itself. The problem is that most developers enable it, assume it's working, and never audit how it's actually behaving against their real traffic patterns.
- How prompt caching billing actually works under the hood
- The specific patterns that make caching cost more instead of less
- How to calculate whether caching is helping or hurting your workload
- Concrete strategies to get the savings caching promises
What Prompt Caching Actually Does
When you send a request to an LLM API, the provider has to process every token in your prompt β system instructions, context, examples, and the user's actual message. Without caching, that full token count is priced every single time.
Prompt caching lets the provider store a processed version of a prefix in your prompt. On subsequent requests that start with that same prefix, the stored version is reused instead of reprocessed. The idea is that your system prompt and static context β which rarely change β get computed once and shared across many calls.
The pricing split is the key detail. Cached token reads are significantly cheaper than fresh input tokens. But writing a new cache entry costs more than a regular input token. Anthropic's Claude API, for example, charges a premium for cache write tokens and a substantial discount for cache read tokens compared to standard input pricing.
The math only works in your favor when reads outnumber writes by enough to overcome that write premium. If your cache isn't being hit consistently, you're paying the write cost repeatedly with no payoff.
The Cache Lifetime Problem
Cache entries don't live forever. Most providers set a cache TTL (time-to-live) in the range of a few minutes to a few hours. Anthropic's minimum TTL is currently around five minutes; OpenAI's prompt caching activates automatically for prompts above a certain length and has its own retention window.
If your request volume is low or bursty, the cache expires between requests. Every call becomes a cache write β and you pay the write premium every time. You get the cost of caching with none of the benefit.
This is the most common trap for teams with moderate traffic. A staging environment, an internal tool used only during business hours, or a batch job that runs once a day will almost never see a cache hit. You pay the write surcharge on every run.
Anatomy of a Costly Cache Miss
Consider a simple pattern: a customer support bot with a long system prompt explaining the product, return policies, and tone guidelines. That system prompt is around 2,000 tokens. The user message averages 50 tokens. The team enables prompt caching, prefixing the system prompt as the cached block.
At peak hours β say, between 9 AM and 5 PM β requests come in every few seconds. The cache stays warm. Cache reads are cheap, and the savings are real. But overnight, between 11 PM and 7 AM, there are occasional off-hours tickets. The cache has long expired by the time the first overnight request arrives. That request writes a new cache entry at the premium rate. If only two or three more requests arrive before the cache expires again, those reads barely offset the write cost.
Over a month, those overnight cache-write cycles quietly add up. The team sees the daytime savings on the dashboard, but the off-hours penalty is buried in the totals.
When Your Prompt Isn't Actually Stable
Cache hits require an exact prefix match. If any token in your cached prefix changes between requests, the cache is invalidated and a new entry is written.
Several common patterns break cache stability without developers noticing:
- Dynamic timestamps or dates injected into the system prompt. Even a field like
Today is {date}at the top of your prompt resets the cache every day β or every hour if you're using a timestamp. - User-specific data in the system prompt. Injecting the user's name, account tier, or preferences into the system prompt means every unique user produces a unique cache key. You're paying cache write rates with zero reuse.
- A/B test variants shuffled at the prompt level. Rotating two or three system prompt versions multiplies your distinct cache entries, halving or thirding your hit rate.
- Whitespace and formatting inconsistencies. If your prompt is assembled by string concatenation with subtle spacing differences across code paths, you may be generating many near-identical but distinct prompts, each cached separately.
How to Audit Your Caching Behavior
Before you optimize, measure. Most provider APIs return cache metadata in their response objects. On Anthropic's API, the usage field in the response includes cache_creation_input_tokens and cache_read_input_tokens. Log both fields on every request.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful customer support agent...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "How do I return an item?"}]
)
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")Once you have this data flowing into your logs or APM tool, calculate your cache hit ratio over rolling time windows. A useful formula:
# Over a sample of N requests
total_cache_reads = sum(r["cache_read_input_tokens"] for r in requests)
total_cache_writes = sum(r["cache_creation_input_tokens"] for r in requests)
hit_ratio = total_cache_reads / (total_cache_reads + total_cache_writes)
print(f"Cache hit ratio: {hit_ratio:.2%}")A hit ratio below roughly 80% is a warning sign that your traffic pattern or prompt structure is working against you. The exact break-even point depends on the specific pricing ratio between write and read tokens for your provider and model tier.
Structuring Your Prompt for Consistent Hits
The cached prefix must be the stable, static part of your prompt. Everything dynamic goes after the cache boundary. This sounds obvious, but it requires deliberate prompt architecture.
A few rules that hold across providers:
- Put your system instructions, persona definition, and static knowledge base before any dynamic content in the prompt.
- Never inject real-time data (dates, session IDs, user attributes) into the cached block. Move those to the user turn or to a dynamic section after the cache boundary marker.
- If you have multiple stable prompt variants (for example, two product lines with different personas), treat each as a separate cache entry and make sure each gets enough traffic to justify its own write cost.
- Audit your prompt assembly code for sources of accidental variation β extra spaces, inconsistent newlines, or conditionally appended sentences that trigger on most but not all requests.
The Right Workloads for Prompt Caching
Caching pays off most reliably when your workload has a specific shape: high request volume, a large stable prefix, and relatively consistent inter-request timing that keeps the cache warm.
Good fits include:
- High-traffic chatbots with a long, unchanging system prompt
- Document Q&A tools where the document is fixed and many users query it in a short window
- Code assistants where the codebase context is injected as a large static block and multiple queries arrive in quick succession
Poor fits include:
- Low-volume internal tools with unpredictable usage patterns
- Batch jobs with long gaps between runs
- Pipelines where each request has unique system-level context
- Any prompt where dynamic data is unavoidably mixed into the static prefix
Common Pitfalls to Watch For
Assuming caching is on by default and working correctly. Some providers require you to explicitly mark cache points in your prompt structure. Check whether your integration is actually sending cache control markers, not just hoping the provider detects a reusable prefix.
Caching short prompts. Most providers only activate caching above a minimum token threshold (often a few hundred tokens). Caching a 200-token system prompt may silently do nothing while you assume it's active.
Not accounting for model version changes. When you upgrade to a new model version, existing cache entries are typically invalidated. If you roll out a model upgrade during peak hours, you'll pay write costs on a flood of incoming requests. Plan upgrades for low-traffic windows.
Ignoring the output side. Caching reduces input token costs. Output tokens are always priced at full rate. If your prompts are optimized but your completions are verbose and uncontrolled, output costs will dominate regardless of how well your cache performs.
Next Steps
Prompt caching is a real cost lever, but only if you treat it like infrastructure rather than a checkbox. Here's where to start:
- Add cache metadata logging today. You can't optimize what you can't measure. Log
cache_creation_input_tokensandcache_read_input_tokenson every API response and compute hit ratios per endpoint or use case. - Audit your prompt assembly code for sources of accidental variation in the cached prefix. Even small inconsistencies kill your hit rate.
- Restructure prompts so all static content leads the prefix and all dynamic content follows the cache boundary. Move dates, user IDs, and session context to the user turn.
- Evaluate your traffic patterns against the cache TTL. If your workload is bursty or low-volume, model out the write-cost penalty to see whether caching is net positive at your actual request rate.
- Set up a cost alert on your LLM API spend that fires if cache write tokens exceed a threshold percentage of total input tokens in any given hour. This catches regressions when prompt changes silently break your cache structure.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!