VectorWave
Semantic Caching
Reduce LLM costs and latency by caching semantically similar inputs.
How Semantic Caching Works
Traditional caching matches inputs exactly. Semantic caching matches inputs by meaning.
"How do I fix a Python bug?" → Cache MISS → Execute → 2.0s
"Tell me how to debug Python." → Cache HIT → Return → 0.02s
VectorWave converts function inputs to embedding vectors and compares them using cosine similarity. If a new input is similar enough to a cached one, the stored result is returned instantly.
Basic Usage
import time
from vectorwave import vectorize, initialize_database
initialize_database()
@vectorize(semantic_cache=True, cache_threshold=0.95, auto=True)
def expensive_llm_task(query: str):
time.sleep(2) # Simulates LLM API call
return f"Processed result for: {query}"
# First call: Cache Miss → executes normally (2.0s)
print(expensive_llm_task("How do I fix a Python bug?"))
# Second call: Cache Hit → returns instantly (0.02s!)
print(expensive_llm_task("Tell me how to debug Python code."))
Configuration
cache_threshold
The cosine similarity threshold for considering two inputs as "similar enough."
@vectorize(
semantic_cache=True,
cache_threshold=0.95, # 0.0 to 1.0
)
| Threshold | Behavior | Use Case |
|---|---|---|
0.99 | Very strict — nearly identical inputs only | Financial calculations |
0.95 | Recommended default — similar meanings | General LLM caching |
0.90 | Lenient — broader matches | FAQ / Knowledge base |
0.85 | Very lenient — loose semantic matches | Creative / exploratory |
capture_return_value
Required for caching. Without this, VectorWave can't return a cached result:
@vectorize(
semantic_cache=True,
cache_threshold=0.95,
capture_return_value=True, # Stores the return value
)
def my_function(query: str):
return llm.complete(query)
Cache Scope (Multi-tenancy)
By default, the cache is global. Use semantic_cache_scope to isolate caches by dynamically extracting filter values from function arguments at runtime:
# Per-user cache isolation
@vectorize(
semantic_cache=True,
cache_threshold=0.95,
semantic_cache_scope=["user_id"], # Extract from function args
)
def personalized_query(user_id: str, query: str):
return llm.complete(query, user_context=get_user(user_id))
# Multi-dimension cache isolation
@vectorize(
semantic_cache=True,
semantic_cache_scope=["user_id", "region"], # Multiple scope keys
)
def regional_query(user_id: str, region: str, query: str):
return llm.complete(query, region=region)
When semantic_cache_scope=["user_id"] and the function is called with fn(user_id="U123", query="..."), the cache lookup automatically filters by {"user_id": "U123"}. This ensures each user gets their own isolated cache.
For static filters (not derived from function args), use semantic_cache_filters:
@vectorize(
semantic_cache=True,
semantic_cache_filters={"environment": "production"},
)
def production_query(query: str):
return llm.complete(query)
Golden Dataset Priority
When semantic caching is enabled, VectorWave uses a 2-tier cache lookup:
- Priority 1: Golden Dataset — Searched first. These are manually curated, verified results.
- Priority 2: Standard Executions — Searched if no Golden match. Uses all successful execution logs.
Golden matches are deterministic — they ensure consistent, high-quality results for known input patterns. See Golden Dataset for management details.
How Cache Lookup Works
Input: "How do I debug Python?"
│
▼
Embedding Vector
[0.12, -0.45, 0.78, ...]
│
▼
Search Golden Dataset (Priority 1)
│
┌────┴────┐
│ Miss │ Hit → Return Golden Result
▼
Search Standard Executions (Priority 2)
│
▼
Cosine Similarity Check
┌──────────────────────────┐
│ Cached: "Fix Python bug" │
│ Similarity: 0.97 │
│ Threshold: 0.95 │
│ Result: HIT ✓ │
└──────────────────────────┘
│
▼
Return Cached Result
(0.02s vs 2.5s)
Performance Impact
| Metric | Without Caching | With Caching (Hit) |
|---|---|---|
| Latency | ~2.5s (LLM API) | ~0.02s |
| Cost per call | ~$0.03 | $0.00 |
| Token usage | Full | Zero |
For applications with repetitive queries (customer support, FAQ bots, search), caching typically achieves a 60-90% hit rate, reducing LLM costs proportionally.
Monitoring Cache Performance
Check cache hit rates via VectorWave's execution logs:
from vectorwave import search_executions
# Find all cached executions for a function
cached = search_executions(
limit=100,
filters={"function_name": "expensive_llm_task", "cache_hit": True},
sort_by="timestamp_utc",
)
print(f"Cache hits: {len(cached)}")
VectorSurfer: Real-time cache hit rates, performance metrics, and execution timelines are visualized in the VectorSurfer dashboard — monitor caching performance at a glance.
External Resources
- OpenAI API Documentation — Embedding API used for semantic similarity
Next Steps
- Golden Dataset — Curate high-quality data for cache priority
- Self-Healing — Automatic error diagnosis
- Drift Detection — Monitor input quality