How It Works
Cachecore is a transparent reverse proxy. Every request to /v1/chat/completions passes through a multi-stage pipeline that checks cache, manages concurrency, and writes new entries on miss.
Request pipeline
POST /v1/chat/completions
│
├─ Rate limit (per-tenant or per-IP bypass)
├─ JWT validation → namespace derivation
│
├─ L1: exact-match lookup
│ ├─ Fresh hit → return cached response (~5ms) [X-Cache: HIT_L1]
│ ├─ Stale hit → return cached response + background refresh (SWR) [X-Cache: HIT_L1_STALE]
│ └─ Miss → continue
│
├─ L2: semantic search
│ ├─ Hit (cosine ≥ 0.92) → return cached response (~15ms) [X-Cache: HIT_L2]
│ └─ Miss → continue
│
├─ Singleflight (distributed lock)
│ ├─ Leader → forward to OpenAI
│ └─ Follower → wait for leader's result
│
└─ Write L1 + L2 entries, register deps, return response [X-Cache: MISS]
L1: exact match
The gateway computes SHA-256 of the full request body. If the hash matches a stored entry in the same namespace, the cached response is returned from Redis in ~5ms.
L1 catches identical API calls: retries, cron jobs, and the same code path executing with the same inputs.
L1 does not match paraphrased requests, whitespace changes, or any parameter difference.
L2: semantic match
The user's messages are embedded using bge-small-en-v1.5 (384-dimensional vectors). The embedding is searched in a Redis HNSW index using cosine similarity, filtered by namespace.
A cached response is returned if similarity is ≥ 0.92. Five gates must pass before an L2 hit is served:
- Cosine similarity ≥ 0.92
- Same namespace (tenant isolation)
- Entry not expired
- All dependency hashes still valid
- Model name matches
L2 catches paraphrased requests: "summarise contract #123" and "please summarize contract number 123" match the same entry.
Stale-while-revalidate (SWR)
Cache entries have two time windows:
| Window | Default | Behaviour | |--------|---------|-----------| | Fresh | 3,000s (50 min) | Served directly from cache | | Stale | 600s (10 min) | Served from cache; background job re-fetches from OpenAI | | Expired | after 3,600s | Key deleted from Redis |
SWR keeps latency low as entries age. The background refresh runs at most once per stale entry (guarded by a distributed lock). SWR is skipped if the originating JWT has expired or the gateway is shutting down.
Per-tenant TTL overrides are supported via JWT claims (fresh_ttl_secs, stale_window_secs), capped by server-side maximums.
Singleflight
When multiple identical requests arrive concurrently and all miss cache, only one (the leader) is forwarded to OpenAI. The others (followers) wait up to 5 seconds for the leader to write the result, then read it from L1. This prevents thundering-herd scenarios on cold starts.
Namespace derivation
A namespace is a SHA-256 hash of: tenant ID, policy version, system prompt fingerprint, toolset fingerprint, and permission scope. Two requests can only share cache entries if they produce the same namespace. See Namespaces.