What is Cachecore

Cachecore is a semantic caching gateway for LLM APIs. It sits between your application and OpenAI, caching responses at two levels to reduce cost and latency for repeated or similar requests.

The problem

LLM API calls are slow (1-3 seconds) and expensive. In agent workloads, many of these calls are redundant: the same classification prompt runs thousands of times, the same tool-selection query repeats across sessions, the same document gets summarised for different users. Without caching, you pay full price every time.

What Cachecore does

You point your OpenAI SDK at Cachecore's gateway instead of OpenAI directly. Cachecore intercepts every request and checks two cache tiers before forwarding:

L1 (exact match): The request body is hashed. If an identical request was seen before in the same tenant namespace, the cached response is returned in ~5ms.

L2 (semantic match): The prompt is embedded using bge-small-en-v1.5 and searched against a Redis HNSW index. If a semantically similar request scores above 0.92 cosine similarity, the cached response is returned in ~15ms.

Miss: No match found. The request is forwarded to OpenAI, the response is cached, and both L1 and L2 entries are written.

Integration

One line of code:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.cachecore.it/v1",  # only change
    api_key="cc_live_xxxxx.eyJ..."
)

Your existing code works unchanged. Cachecore is a transparent proxy: same API contract, same response format.

Core capabilities

Two-tier caching. L1 handles identical requests. L2 handles paraphrased or semantically equivalent requests. Together, they cover both exact repetition and natural language variation.

Tenant isolation. Every cache entry is scoped to a cryptographic namespace derived from your tenant ID, system prompt, tool definitions, and policy version. Tenants never share cache entries.

Dependency invalidation. Tag cached responses with data dependencies (e.g., doc:contract-123). When the underlying data changes, invalidate the tag and all associated cache entries are deleted.

Stale-while-revalidate. After the fresh TTL expires, stale entries are served immediately while a background job refreshes the cache from OpenAI. Latency stays low even as entries age.

When to use Cachecore

Cachecore works best for task-oriented LLM calls that repeat across users or sessions: classification, routing, entity extraction, document summarisation, tool selection. These workloads typically achieve 40-70% cache hit rates.

Cachecore is less effective for stateful multi-turn conversations where the full message history changes on every turn. See Caching for AI Agents for a detailed breakdown.