Claude Prompt Caching Best Practices (2026)

If you're building anything on the Claude API — a chatbot, an AI agent, a document processing pipeline — you're probably sending the same system prompt, the same tool definitions, and the same context on every single API call. And you're paying full price for it every time.

Prompt caching fixes that. It lets the API remember content you've already sent, so subsequent requests only process what's new. The result: up to 90% lower costs and 85% faster responses for long prompts.

We use prompt caching in our own AI chatbot and in client projects. This guide covers exactly how it works, what the common mistakes are, and how to set it up correctly.

What Is Prompt Caching?

Every time you call the Claude API, you send a prompt — system instructions, conversation history, tool definitions, documents, examples. The API processes all of that input before generating a response. For a chatbot with a long system prompt and 10 turns of conversation, that's a lot of repeated work.

Prompt caching stores the processed version of your prompt prefix (the KV cache representation) so Claude doesn't have to reprocess it from scratch on every call. When you send a request with caching enabled, the system checks if it's already seen this exact content. If it has, it skips the processing and reads from cache.

Two important things to know upfront:

It doesn't store your actual text. The cache holds cryptographic hashes and KV representations, not your raw prompts. This matters for compliance and data retention policies.
It doesn't change the output. Responses are identical whether the prompt was cached or not. Caching only affects cost and speed.

How It Works

The flow is straightforward:

You send a request with caching enabled
The system checks if a matching prompt prefix exists in cache
If yes → it reads the cached prefix (fast and cheap), then processes only the new content
If no → it processes the full prompt and writes it to cache for next time

The cache has a 5-minute TTL (time-to-live) by default. Every time the cached content gets used, the timer resets. So if your chatbot is getting messages every few minutes, the cache stays warm indefinitely. If nobody talks to it for 5 minutes, the cache expires and the next request pays the full write cost again.

There's also a 1-hour TTL option at 2x the base input token price. Worth it if your usage is bursty — say, a batch processing job that runs every 30 minutes.

The cache follows a strict hierarchy: tools → system → messages. Everything is prefix-matched, meaning content must be identical from the start. You can't cache the middle of a prompt.

The Two Caching Modes

Automatic Caching

The easiest way to start. Add one field to your request, and the system handles everything:

// Add cache_control at the top level of your request
{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "cache_control": { "type": "ephemeral" },
  "system": "Your system prompt here...",
  "messages": [
    { "role": "user", "content": "First message" },
    { "role": "assistant", "content": "First response" },
    { "role": "user", "content": "Second message" }
  ]
}

The system automatically caches everything up to the last cacheable block. As the conversation grows, the cache point moves forward. Previous content gets read from cache, and only the new turns get written.

Best for: multi-turn conversations, chatbots, any scenario where context grows over time.

Explicit Cache Breakpoints

For more control, you place cache_control markers on specific content blocks. This lets you cache different sections independently — useful when parts of your prompt change at different rates.

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "system": [
    {
      "type": "text",
      "text": "You are a legal document assistant...",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "tools": [
    {
      "name": "search_documents",
      "description": "Search legal database",
      "input_schema": { ... },
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [...]
}

You can set up to 4 cache breakpoints per request. The system checks for cache hits by working backwards from each breakpoint, checking up to 20 blocks before it.

Best for: complex prompts with stable tool definitions, long system prompts with dynamic message content, or any scenario where you need fine-grained cache control.

You can also combine both approaches — explicit breakpoints on your system prompt and tools, plus automatic caching on the conversation. Just remember that automatic caching uses one of your 4 available breakpoint slots.

Pricing Breakdown (With Real Math)

Here's the pricing structure. Cache writes cost 25% more than normal input. Cache reads cost 90% less. That's where the savings come from.

Model	Base Input	Cache Write (5m)	Cache Read
Claude Opus 4.5/4.6	$5/MTok	$6.25/MTok	$0.50/MTok
Claude Sonnet 4/4.5/4.6	$3/MTok	$3.75/MTok	$0.30/MTok
Claude Haiku 4.5	$1/MTok	$1.25/MTok	$0.10/MTok

Let's run the numbers on a real scenario.

Scenario: AI chatbot with a 4,000-token system prompt, handling 50 conversations/day with an average of 8 turns each.

Without caching, every turn sends the full system prompt. That's 4,000 tokens × 8 turns × 50 conversations = 1.6M input tokens/day just for the system prompt. On Sonnet 4.6, that's $4.80/day or about $144/month — just for the system prompt alone.

With caching, the system prompt gets written to cache once per conversation (50 writes/day = 200K tokens at $3.75/MTok = $0.75) and read from cache for the other 7 turns (350 reads/day = 1.4M tokens at $0.30/MTok = $0.42). Total: $1.17/day or about $35/month.

That's a 76% cost reduction on just the system prompt. Factor in conversation history that also gets cached, and savings climb higher with longer conversations.

What Breaks the Cache

This is where most people waste money. The cache requires exact prefix matches. Any change to cached content invalidates everything downstream. Here's the full breakdown:

Change	Tools Cache	System Cache	Messages Cache
Modify tool definitions	Invalidated	Invalidated	Invalidated
Toggle web search	OK	Invalidated	Invalidated
Toggle citations	OK	Invalidated	Invalidated
Change speed setting	OK	Invalidated	Invalidated
Change tool_choice	OK	OK	Invalidated
Add/remove images	OK	OK	Invalidated
Change thinking settings	OK	OK	Invalidated

The most common mistakes we see:

Timestamps in system prompts. If your system prompt includes The current date is February 28, 2026, that changes every day. The entire cache invalidates at midnight. Move dynamic content to the user message instead.
Randomized JSON key ordering. Some languages (Go, Swift) randomize object key order during JSON serialization. If your tool definitions serialize differently each time, the cache never hits. Pin your key ordering.
Adding/removing tools between calls. Tool definitions sit at the top of the cache hierarchy. Changing them invalidates everything. Define your full tool set once and keep it consistent.
Not meeting the minimum token threshold. System prompts under the minimum simply won't cache, and you'll never see it in the response — the API just quietly skips caching. Check cache_creation_input_tokens in the response to verify.

Best Practices by Use Case

Chatbots and Conversational Agents

Use automatic caching. It handles the growing conversation history for you. Structure your prompt so the system instructions and tool definitions come first (they're the most stable), then conversation messages.

Keep your system prompt consistent across all conversations
Move dynamic data (user name, date, session ID) into the first user message, not the system prompt
If conversations pause for more than 5 minutes, expect one cache write on the next message

Document Processing

Embed the full document in your prompt and cache it. Users can then ask multiple questions about the same document without reprocessing it each time.

Use explicit breakpoints — cache the document separately from the questions
For large documents that push past 100K tokens, consider the 1-hour TTL if processing takes time
Anthropic's own benchmarks show a 100K-token cached document going from 11.5s to 2.4s latency

Agentic Tool Use

AI agents make many sequential API calls — each step requires the full context plus new tool results. Caching is critical here because each tool call iteration would otherwise resend everything.

Cache your tool definitions with an explicit breakpoint (they almost never change mid-session)
Use automatic caching for the growing conversation
Be careful with extended thinking — non-tool-result user messages strip all previous thinking blocks from cache

Many-Shot Prompting

If you're including 20+ examples in your prompt to improve quality, caching pays for itself immediately. Those examples are identical across calls — they should only be processed once.

Put all examples in the system prompt with a cache breakpoint after them
Anthropic reports 86% cost savings on a 10K-token many-shot prompt

Getting Started

Here's the fastest path to saving money with prompt caching:

Step 1: Add One Line of Code

If you're building a chatbot or anything with multi-turn conversations, just add cache_control at the top level:

# Python
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system="Your system prompt...",
    messages=conversation_history
)

# TypeScript
const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  cache_control: { type: "ephemeral" },
  system: "Your system prompt...",
  messages: conversationHistory,
});

Step 2: Check the Response

Every response includes cache metrics in the usage object:

{
  "usage": {
    "input_tokens": 50,
    "cache_creation_input_tokens": 4000,
    "cache_read_input_tokens": 0,
    "output_tokens": 200
  }
}

On the first call, you'll see cache_creation_input_tokens (tokens written to cache). On subsequent calls, you should see cache_read_input_tokens instead. If you're always seeing cache writes and never reads, something is breaking your cache — check the table above.

Step 3: Monitor and Optimize

Track your cache hit rate over time. A healthy implementation should see 80-95% cache reads after the first request in each conversation. If you're below that:

Check if your system prompt has dynamic content that changes between calls
Verify your tool definitions are stable and consistently ordered
Make sure your prompt meets the minimum token threshold for the model you're using
Confirm requests are happening within the 5-minute TTL window

Minimum Token Requirements

Caching only kicks in above these thresholds:

Model	Minimum Tokens
Claude Opus 4.5 / 4.6	4,096
Claude Sonnet 4.6	2,048
Claude Sonnet 4 / 4.5	1,024
Claude Haiku 4.5	4,096

If your system prompt is under 1,024 tokens, it won't cache regardless of settings. Pad it with examples or additional context if needed — the caching savings will more than offset the extra input tokens.

Bottom Line

Prompt caching is one of the highest-ROI optimizations you can make on the Claude API. One line of code, no behavior changes, immediate cost savings. If you're spending more than $50/month on Claude API calls and you're not using caching, you're leaving money on the table.

For the full technical reference, see Anthropic's official prompt caching docs.

Need help implementing prompt caching or optimizing your Claude API integration? Reach out — this is exactly the kind of thing we do for clients.

Written by

Elevated AI Consulting

Sam Irizarry is the founder of Elevated AI Consulting, helping businesses grow through strategic marketing and AI-powered solutions. With 12+ years of experience, Sam specializes in local SEO, web design, AI integration, and marketing strategy.

Learn more about us →

Tags:ai claude api developer-tools cost-optimization

Share:in 𝕏 f

Claude Prompt Caching: How to Cut Your API Costs by 90%

What Is Prompt Caching?

How It Works

The Two Caching Modes

Automatic Caching

Explicit Cache Breakpoints

Pricing Breakdown (With Real Math)

What Breaks the Cache

Best Practices by Use Case

Chatbots and Conversational Agents

Document Processing

Agentic Tool Use

Many-Shot Prompting

Getting Started

Step 1: Add One Line of Code

Step 2: Check the Response

Step 3: Monitor and Optimize

Minimum Token Requirements

Bottom Line

Elevated AI Consulting

Ready to put AI to work for your business?