Prompt Caching - Concepts and Optimizations

Updated on: 04 Oct 2025 - Vivek Singh

Welcome to the comprehensive notes from our LLM Engineering session. This document transforms raw discussion points into a structured, educational resource for intermediate-level LLM developers and researchers. We focus on optimizing LLM applications for performance, cost, and quality.

Table of Contents
1. Core LLM Engineering Concepts
- 1.1 Prompt Engineering: Key Techniques
2. 🚀 Advanced Optimization: Prompt Caching
3. Context Augmentation: Fine-Tuning vs. RAG
4. Quality Assurance: Evaluation Metrics
- Classical Language Generation Metrics
- Modern LLM Evaluation
5. Ecosystem Tools
6. Operational Challenges
Key Takeaways

1. Core LLM Engineering Concepts

LLM Engineering is the discipline of effectively applying and optimizing Large Language Models for real-world applications. It bridges the gap between foundational model research and scalable product deployment.

1.1 Prompt Engineering: Key Techniques

Prompt Engineering involves carefully designing input text (the prompt) to guide an LLM to produce a desired and high-quality output. It is the most accessible and often most cost-effective way to improve a model’s performance without changing its weights.

Technique	Description	Simple Example
Chain-of-Thought (CoT)	Instructing the model to output a step-by-step reasoning path before the final answer. This improves the model’s complex reasoning and math problem-solving.	“First, break down the problem. Second, show your math. Finally, state the answer. Problem: If a widget costs $5 and you buy 3…”
Few-Shot Prompting	Providing the model with a few examples of input-output pairs to teach it the desired task format, tone, or style before the final query.	`Input: "Happy" -> Sentiment: Positive. Input: "Upset" -> Sentiment: Negative. Input: "Indifferent" -> Sentiment:`
Zero-Shot Prompting	Giving the model an instruction with no examples, relying entirely on its pre-trained knowledge.	`Classify the sentiment of the following text: "The weather is okay."`

2. 🚀 Advanced Optimization: Prompt Caching

Prompt Caching is a critical optimization technique for reducing the latency and cost of LLM applications by reusing the computational work done on common prompt segments. It is the “lowest hanging fruit” for scaling high-traffic LLM services.

How Prompt Caching Works

LLMs process tokens sequentially. The most expensive part of processing a long prompt, especially a static one (like system instructions or a document), is the initial computation of the internal Key and Value (KV) state for the model’s self-attention mechanism.

Cache Miss (First Request): A request with a long, static prefix (e.g., a system instruction) is sent. The model computes the KV state for the entire prefix. This KV state (the internal memory/context of the prefix) is then stored in a fast cache, associated with a unique hash of the prefix.
Cache Hit (Subsequent Request): A new request arrives with the exact same static prefix and new, dynamic content (the user’s query).
The system identifies the prefix, retrieves the pre-computed KV cache for that segment, and loads it directly into the model’s memory.
The LLM then only needs to process and compute the KV state for the new, dynamic tokens, dramatically reducing latency and lowering the cost of the cached tokens (often 50-90% discount).

Drawbacks:

Exact Match Requirement: Caching typically requires an exact string prefix match for a cache hit, making it sensitive to slight formatting changes (e.g., a single space).
Time-to-Live (TTL): Caches are ephemeral and expire after a short period (e.g., 5-60 minutes) of inactivity, limiting its benefit for low-frequency or highly variable requests.
Complexity: Managing cache keys and expirations for non-API-managed caches adds infrastructure complexity.

Provider Implementations: OpenAI, Gemini, and Others

Leading providers have built-in caching, though their approaches differ:

Provider	Mechanism	Minimum Prompt Size	Configuration	Key Detail
OpenAI (GPT-4o, etc.)	Implicit Caching	$\ge 1,024$ tokens	Automatic (No code changes)	Caches typically last 5-10 minutes of inactivity; applies a cost discount and latency reduction to the cached prefix.
Google Gemini (Vertex AI)	Implicit and Explicit Caching	$\ge 2,048$ tokens (Implicit)	Implicit is automatic. Explicit requires API calls (e.g., `CachedContent.create()`) to manage the cache lifetime and content.	Explicit caching offers guaranteed cache hits for a configurable TTL, offering more control for large, long-lived contexts.
Amazon Bedrock (Claude, etc.)	Explicit Caching (Context Caching)	Model-dependent (e.g., $\ge 1,024$ tokens)	Requires setting a `cache_control` parameter or checkpoint markers in the prompt to define the prefix to be cached.	Gives granular control over which part of a prompt is cached using markers.

Implicit vs. Explicit Caching

Feature	Implicit Caching	Explicit Caching
Mechanism	Automatic system-level detection of common prefixes.	Manual declaration of content to be cached via API.
Control	Low control; hit rate is not guaranteed.	High control; guaranteed hit if cache is referenced.
Cost	Automatic discount on cached tokens (e.g., 50-90% off).	Initial write cost may be higher; subsequent read cost is significantly discounted.
Best For	High-volume, short-burst traffic with a static system prompt.	Large, long-lived, and expensive-to-process contexts (e.g., a 100-page policy document in a RAG system).

How to Effectively Use Prompt Caching

To maximize your cache hit rate and cost savings, follow this structured prompt design rule:

Place Static Content First: The cache hit depends on an exact prefix match. System instructions, tool definitions, few-shot examples, and RAG-retrieved documents that are common to all users should go at the very start.
- Example (Good):[System Prompt] [Tool Schema] [RAG Context] [User Query]
- Example (Bad):[User Query] [System Prompt] [RAG Context] (The dynamic user query breaks the common prefix).
Use Explicit Caching for Long Contexts: For lengthy, static documents you query frequently (e.g., a code repository or a PDF), use explicit caching (if available) to ensure the expensive processing is done once and the KV state is available for its full TTL.
Monitor Metrics: Track the cached_tokens field (available in API usage metadata for most providers) and your cache-hit percentage. This ensures your prompt structuring efforts are yielding the intended cost and latency benefits.

3. Context Augmentation: Fine-Tuning vs. RAG

When an LLM lacks the specific, up-to-date, or proprietary knowledge required for a task, we use context augmentation techniques. The two primary methods are Retrieval-Augmented Generation (RAG) and Fine-Tuning.

Feature	Fine-Tuning	Retrieval-Augmented Generation (RAG)
Goal	Teach the model new skills, style, or format.	Give the model new external knowledge for the current query.
Process	Update a subset of the model’s weights using a dataset of (prompt, completion) pairs.	Retrieve relevant chunks of text from an external knowledge base (Vector Database) and inject them into the prompt’s context window.
Knowledge	Baked into the model’s internal weights.	Stored in an external database.
Cost & Time	High cost, long training time (hours/days).	Low cost, low latency retrieval time (milliseconds).
Flexibility	Low: Requires re-training for every knowledge update.	High: Knowledge base can be updated in real-time without model changes.
Ideal Use	Customizing the model’s persona, enforcing a specific output structure (e.g., JSON), or learning a new programming language syntax.	Providing up-to-date facts, company policies, or private documents to ground the response.

Key Takeaway: Start with RAG for new facts/data. Use Fine-Tuning to improve the model’s ability to use those facts in a specific format or tone.

4. Quality Assurance: Evaluation Metrics

Measuring the performance of LLM outputs is challenging due to the open-ended nature of generation. We rely on a mix of classical and modern metrics.

Classical Language Generation Metrics

These compare the generated output (candidate) to a human-written target (reference).

Perplexity ($\text{PPL}$): Measures how well a probability model (the LLM) predicts a sample. A lower PPL indicates the model is more confident in its generated text sequence. It is often used to evaluate model quality after pre-training or fine-tuning, but less so for final application quality.
BLEU (Bilingual Evaluation Understudy): Measures the precision of $n$-grams (sequences of $n$ words) in the candidate text against the reference text. It is excellent for translation tasks but often a poor metric for open-ended generation, as a fluent and correct answer can still receive a low BLEU score if it uses different phrasing than the reference.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Focuses on the recall of $n$-grams (i.e., how many $n$-grams in the reference are present in the candidate). It is the standard metric for summarization.

Modern LLM Evaluation

Human Evaluation: The gold standard. Evaluators rate responses on dimensions like Helpfulness, Factuality, and Safety.
LLM-as-a-Judge: Using a powerful LLM (e.g., GPT-4 or Gemini Advanced) to evaluate the output of a less powerful model. This is faster and cheaper than human evaluation and has shown high correlation with human scores, particularly for coherence and helpfulness.

5. Ecosystem Tools

Effective LLM engineering relies on a robust toolchain.

LangChain: A framework for developing applications powered by LLMs. It standardizes the interfaces for components like Prompt Templates, Models, Chains (sequences of calls), and Agents (models that can decide which tool to use). It significantly reduces the boilerplate required to build complex RAG or agentic systems.
HuggingFace: The central hub for open-source AI models, datasets, and tools.
- Hugging Face Hub: Hosts millions of open-source models (including Llama, Mixtral, etc.) for direct use or fine-tuning.
- transformers Library: Provides a unified API for using most major models, making model swapping easy.
- PEFT (Parameter-Efficient Fine-Tuning): Tools like LoRA (Low-Rank Adaptation) are available via HuggingFace for efficient fine-tuning of massive models on consumer-grade GPUs.

6. Operational Challenges

Deploying and maintaining LLM applications introduces unique challenges beyond traditional software development.

Hallucination: The model generates factually incorrect or nonsensical information, often presented with high confidence.
- Mitigation:RAG (to ground the model in external data) and aggressive Prompt Engineering (to instruct the model to state “I don’t know” rather than guess).
Bias: The model reflects and perpetuates biases present in its massive training data.
- Mitigation:Guardrails (system prompts or external filters) to define acceptable behavior, and careful Fine-Tuning on de-biased datasets.
Cost Management: The token-based pricing model can lead to unpredictable costs, especially with long-context models.
- Mitigation:Prompt Caching (see Section 2), strict token-limit enforcement, and use of smaller, task-specific models where possible.

Key Takeaways

Prioritize Prompt Engineering: It’s the cheapest, fastest way to improve quality. Use CoT for reasoning and Few-Shot for formatting.
Cache for Scale: Implement Prompt Caching immediately for any application with long, repeated system prompts or RAG contexts to cut costs and latency by a significant margin. Structure your prompts to put static content first.
RAG over Fine-Tuning for Facts: Use RAG for rapid, real-time factual updates. Reserve Fine-Tuning for teaching the model a new, specific skill or style.
Evaluate Holistically: Move beyond simple metrics like BLEU. Use LLM-as-a-Judge or dedicated human evaluation for application-level quality.

Would you like me to generate a simple Python example showing how to structure a prompt to maximize the chances of a Prompt Cache hit for a RAG system?