L1 & L2 Caching

Cachecore uses two cache tiers. L1 handles identical requests via exact hash matching. L2 handles semantically similar requests via vector search. Together, they cover both exact repetition and natural language variation.

L1: exact match

Latency: ~5ms

The full request body is hashed with SHA-256. If the hash matches a stored entry in the same namespace, the cached response is returned from Redis.

L1 hits when the request is byte-identical to a previous one: same model, same messages, same parameters. This catches retries, cron jobs, and identical code paths.

L1 does not match paraphrased requests, whitespace differences, or any parameter change.

L2: semantic match

Latency: ~15ms

The user's messages are embedded using bge-small-en-v1.5 (384-dimensional dense vectors). The embedding is searched against a Redis HNSW index using cosine similarity, filtered by namespace.

Threshold: 0.92 cosine similarity. A cached response is only returned if the match exceeds this threshold.

L2 catches paraphrased prompts:

"Summarise contract #123" matches "Please summarize contract number 123"
"Classify as billing or technical" matches "Is this a billing issue or a technical issue?"

L2 does not match requests with substantially different meaning, even if they share some keywords.

Acceptance gates

Before an L2 hit is served, five conditions are checked:

Cosine similarity ≥ 0.92
Same namespace (enforces tenant isolation)
Entry has not expired
All declared dependency hashes are still valid
Model name matches (a gpt-5.4-mini request cannot hit a gpt-5.4 cache entry)

If any gate fails, the request is treated as a miss.

TTL and stale-while-revalidate

Cache entries have two time windows:

| Window | Default | Behaviour | |--------|---------|-----------| | Fresh | 3,000s (50 min) | Served directly from cache | | Stale | 600s (10 min) | Served from cache; background job refreshes from OpenAI | | Expired | after 3,600s | Redis key deleted, request is a miss |

Redis SETEX TTL = fresh_ttl_secs + stale_window_secs.

Per-tenant overrides are available via JWT claims (fresh_ttl_secs, stale_window_secs), capped by server-side maximums (max_fresh_ttl_secs: 86,400s, max_stale_window_secs: 3,600s).

When each tier applies

| Scenario | Expected tier | Reason | |----------|---------------|--------| | Cron job re-running the same prompt | L1 | Identical request body | | Users asking similar questions | L2 | Semantic match across paraphrases | | Multi-turn chat with history | Miss | Message array changes every turn | | Same document summarised by different users | L2 | Same content, different phrasing | | Retry after transient failure | L1 | Identical request |