CacheCore
Caching for AI Agents
CacheCore

Caching for AI Agents

Semantic caching works exceptionally well for AI agent workloads, but only for specific types of LLM calls. The key distinction is stateless tasks versus stateful conversations.

Stateless tasks cache well. Stateful conversations do not.

An AI agent makes many types of LLM calls. Some are structurally stateless: the request encodes a complete, self-contained task. Others are stateful: the request includes the full conversation history, which changes on every turn.

Stateless calls produce high cache hit rates because semantically similar inputs yield the same output. Stateful calls are almost never cacheable because the message array is unique on every turn.

High-value caching targets

Classification and routing

messages = [
    {"role": "system", "content": "Classify the ticket as: billing, technical, or general."},
    {"role": "user", "content": ticket_text},
]

If 1,000 users submit variations of "I can't log in", a single cached response covers them all at L2. Classification nodes are the highest-ROI caching target in most agent architectures.

Tool selection

messages = [
    {"role": "system", "content": "You have tools: search_orders, check_status, escalate."},
    {"role": "user", "content": "Where is my order?"},
]

Tool routing queries map a limited set of user intents to a fixed set of tools. High repetition, high semantic overlap.

Document summarisation

messages = [
    {"role": "user", "content": f"Summarise: {document_chunk}"},
]

The same document chunk summarised for different users produces the same output. Cache the summary, not the retrieval.

Entity extraction

messages = [
    {"role": "system", "content": "Extract: company name, date, amount."},
    {"role": "user", "content": contract_paragraph},
]

Structured extraction from repeated document types hits L2 frequently.

What does not cache well

# Multi-turn: the message array grows and changes on every turn
messages = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi! How can I help?"},
    {"role": "user", "content": "I need help with my order"},
    {"role": "assistant", "content": "Sure, what's the order number?"},
    {"role": "user", "content": "It's #12345"},
]

The full history is different on every turn. L1 never hits. L2 might occasionally match the last message, but accuracy is unpredictable. Do not rely on caching for stateful multi-turn flows.

Agent architecture for maximum cache hits

Separate stateful orchestration from stateless task execution:

async def classify_intent(user_message: str) -> str:
    """Stateless. Caches well."""
    with cc.request_context(deps=[Dep("intent-model:v1")]):
        response = await openai.chat.completions.create(
            model="gpt-5.4-mini",
            messages=[
                {"role": "system", "content": "Classify as: order_status, billing, or general."},
                {"role": "user", "content": user_message},
            ],
        )
    return response.choices[0].message.content

async def generate_reply(conversation_history: list) -> str:
    """Stateful. Do not expect cache hits."""
    response = await openai.chat.completions.create(
        model="gpt-5.4-mini",
        messages=conversation_history,
    )
    return response.choices[0].message.content

Route cacheable work through dedicated functions with fixed system prompts and no conversation history. Use caching-unaware paths for stateful generation.

Expected hit rates

A well-designed agent workload typically achieves 40-70% cache hit rates on classification and routing nodes. Overall system hit rates depend on the ratio of cacheable to non-cacheable calls.

Debugging low hit rates

If your hit rate is lower than expected, audit which calls return misses:

| Cause | Fix | | :--- | :--- | | System prompt contains dynamic data (timestamps, user names) | Move dynamic data to the user message | | Tool definitions change frequently | Stabilize tools or use policy-version invalidation | | Requests are genuinely unique (creative generation) | These will not cache. That is expected. | | Different models across calls | L2 gates on model name. Standardise. |

Dependency invalidation for agents

Tag cacheable calls with version deps to keep the cache fresh when your prompts or models evolve:

with cc.request_context(deps=[Dep("intent-model:v3")]):
    response = await openai.chat.completions.create(...)

# After updating the classifier:
await cc.invalidate("intent-model:v3", new_hash="v4")