Semantic Caching

How Semantic Caching Works

Traditional caching matches inputs exactly. Semantic caching matches inputs by meaning.

"How do I fix a Python bug?"     → Cache MISS → Execute → 2.0s
"Tell me how to debug Python."   → Cache HIT  → Return  → 0.02s

VectorWave converts function inputs to embedding vectors and compares them using cosine similarity. If a new input is similar enough to a cached one, the stored result is returned instantly.

Basic Usage

import time
from vectorwave import vectorize, initialize_database

initialize_database()

@vectorize(semantic_cache=True, cache_threshold=0.95, auto=True)
def expensive_llm_task(query: str):
    time.sleep(2)  # Simulates LLM API call
    return f"Processed result for: {query}"

# First call: Cache Miss → executes normally (2.0s)
print(expensive_llm_task("How do I fix a Python bug?"))

# Second call: Cache Hit → returns instantly (0.02s!)
print(expensive_llm_task("Tell me how to debug Python code."))

Configuration

cache_threshold

The cosine similarity threshold for considering two inputs as "similar enough."

@vectorize(
    semantic_cache=True,
    cache_threshold=0.95,  # 0.0 to 1.0
)

Threshold	Behavior	Use Case
`0.99`	Very strict — nearly identical inputs only	Financial calculations
`0.95`	Recommended default — similar meanings	General LLM caching
`0.90`	Lenient — broader matches	FAQ / Knowledge base
`0.85`	Very lenient — loose semantic matches	Creative / exploratory

capture_return_value

Required for caching. Without this, VectorWave can't return a cached result:

@vectorize(
    semantic_cache=True,
    cache_threshold=0.95,
    capture_return_value=True,  # Stores the return value
)
def my_function(query: str):
    return llm.complete(query)

Cache Scope (Multi-tenancy)

By default, the cache is global. Use semantic_cache_scope to isolate caches by dynamically extracting filter values from function arguments at runtime:

# Per-user cache isolation
@vectorize(
    semantic_cache=True,
    cache_threshold=0.95,
    semantic_cache_scope=["user_id"],  # Extract from function args
)
def personalized_query(user_id: str, query: str):
    return llm.complete(query, user_context=get_user(user_id))

# Multi-dimension cache isolation
@vectorize(
    semantic_cache=True,
    semantic_cache_scope=["user_id", "region"],  # Multiple scope keys
)
def regional_query(user_id: str, region: str, query: str):
    return llm.complete(query, region=region)

When semantic_cache_scope=["user_id"] and the function is called with fn(user_id="U123", query="..."), the cache lookup automatically filters by {"user_id": "U123"}. This ensures each user gets their own isolated cache.

For static filters (not derived from function args), use semantic_cache_filters:

@vectorize(
    semantic_cache=True,
    semantic_cache_filters={"environment": "production"},
)
def production_query(query: str):
    return llm.complete(query)

Golden Dataset Priority

When semantic caching is enabled, VectorWave uses a 2-tier cache lookup:

Priority 1: Golden Dataset — Searched first. These are manually curated, verified results.
Priority 2: Standard Executions — Searched if no Golden match. Uses all successful execution logs.

Golden matches are deterministic — they ensure consistent, high-quality results for known input patterns. See Golden Dataset for management details.

How Cache Lookup Works

Input: "How do I debug Python?"
         │
         ▼
   Embedding Vector
   [0.12, -0.45, 0.78, ...]
         │
         ▼
   Search Golden Dataset (Priority 1)
         │
    ┌────┴────┐
    │ Miss    │ Hit → Return Golden Result
    ▼
   Search Standard Executions (Priority 2)
         │
         ▼
   Cosine Similarity Check
   ┌──────────────────────────┐
   │ Cached: "Fix Python bug" │
   │ Similarity: 0.97         │
   │ Threshold: 0.95          │
   │ Result: HIT ✓            │
   └──────────────────────────┘
         │
         ▼
   Return Cached Result
   (0.02s vs 2.5s)

Performance Impact

Metric	Without Caching	With Caching (Hit)
Latency	~2.5s (LLM API)	~0.02s
Cost per call	~$0.03	$0.00
Token usage	Full	Zero

For applications with repetitive queries (customer support, FAQ bots, search), caching typically achieves a 60-90% hit rate, reducing LLM costs proportionally.

Monitoring Cache Performance

Check cache hit rates via VectorWave's execution logs:

from vectorwave import search_executions

# Find all cached executions for a function
cached = search_executions(
    limit=100,
    filters={"function_name": "expensive_llm_task", "cache_hit": True},
    sort_by="timestamp_utc",
)

print(f"Cache hits: {len(cached)}")

VectorSurfer: Real-time cache hit rates, performance metrics, and execution timelines are visualized in the VectorSurfer dashboard — monitor caching performance at a glance.

External Resources

OpenAI API Documentation — Embedding API used for semantic similarity

Next Steps

Golden Dataset — Curate high-quality data for cache priority
Self-Healing — Automatic error diagnosis
Drift Detection — Monitor input quality