CacheCore
How It Works
CacheCore

How It Works

Cachecore is a transparent reverse proxy. Every request to /v1/chat/completions passes through a multi-stage pipeline that checks cache, manages concurrency, and writes new entries on miss.

Request pipeline

POST /v1/chat/completions
│
├─ Rate limit (per-tenant or per-IP bypass)
├─ JWT validation → namespace derivation
│
├─ L1: exact-match lookup
│   ├─ Fresh hit  → return cached response (~5ms)   [X-Cache: HIT_L1]
│   ├─ Stale hit  → return cached response + background refresh (SWR)  [X-Cache: HIT_L1_STALE]
│   └─ Miss       → continue
│
├─ L2: semantic search
│   ├─ Hit (cosine ≥ 0.92)  → return cached response (~15ms)  [X-Cache: HIT_L2]
│   └─ Miss                 → continue
│
├─ Singleflight (distributed lock)
│   ├─ Leader  → forward to OpenAI
│   └─ Follower → wait for leader's result
│
└─ Write L1 + L2 entries, register deps, return response  [X-Cache: MISS]

L1: exact match

The gateway computes SHA-256 of the full request body. If the hash matches a stored entry in the same namespace, the cached response is returned from Redis in ~5ms.

L1 catches identical API calls: retries, cron jobs, and the same code path executing with the same inputs.

L1 does not match paraphrased requests, whitespace changes, or any parameter difference.

L2: semantic match

The user's messages are embedded using bge-small-en-v1.5 (384-dimensional vectors). The embedding is searched in a Redis HNSW index using cosine similarity, filtered by namespace.

A cached response is returned if similarity is ≥ 0.92. Five gates must pass before an L2 hit is served:

  1. Cosine similarity ≥ 0.92
  2. Same namespace (tenant isolation)
  3. Entry not expired
  4. All dependency hashes still valid
  5. Model name matches

L2 catches paraphrased requests: "summarise contract #123" and "please summarize contract number 123" match the same entry.

Stale-while-revalidate (SWR)

Cache entries have two time windows:

| Window | Default | Behaviour | |--------|---------|-----------| | Fresh | 3,000s (50 min) | Served directly from cache | | Stale | 600s (10 min) | Served from cache; background job re-fetches from OpenAI | | Expired | after 3,600s | Key deleted from Redis |

SWR keeps latency low as entries age. The background refresh runs at most once per stale entry (guarded by a distributed lock). SWR is skipped if the originating JWT has expired or the gateway is shutting down.

Per-tenant TTL overrides are supported via JWT claims (fresh_ttl_secs, stale_window_secs), capped by server-side maximums.

Singleflight

When multiple identical requests arrive concurrently and all miss cache, only one (the leader) is forwarded to OpenAI. The others (followers) wait up to 5 seconds for the leader to write the result, then read it from L1. This prevents thundering-herd scenarios on cold starts.

Namespace derivation

A namespace is a SHA-256 hash of: tenant ID, policy version, system prompt fingerprint, toolset fingerprint, and permission scope. Two requests can only share cache entries if they produce the same namespace. See Namespaces.