cozymori
cozymori

Simpler, Easier, For Developers. Open-source frameworks for AI observability.

Products

  • VectorWave
  • VectorSurfer

Resources

  • Documentation
  • GitHub

© 2026 cozymori. All rights reserved.

Built with simplicity.

Overview

Getting Started

  • Introduction
  • Quick Start

VectorWave

  • VectorWave Overview
  • Installation
  • @vectorize Core
  • Semantic Caching
  • Self-Healing
  • Golden Dataset
  • Drift Detection
  • Replay Testing
  • RAG Search
  • Advanced Configuration
  • API Reference

VectorSurfer

  • VectorSurfer Overview
  • Getting Started
  • Usage Guide

Ecosystem

  • VectorCheck
  • VectorSurferSTL
  • Contributing

VectorWave

Semantic Caching

Reduce LLM costs and latency by caching semantically similar inputs.

How Semantic Caching Works

Traditional caching matches inputs exactly. Semantic caching matches inputs by meaning.

"How do I fix a Python bug?"     → Cache MISS → Execute → 2.0s
"Tell me how to debug Python."   → Cache HIT  → Return  → 0.02s

VectorWave converts function inputs to embedding vectors and compares them using cosine similarity. If a new input is similar enough to a cached one, the stored result is returned instantly.

Basic Usage

import time
from vectorwave import vectorize, initialize_database

initialize_database()

@vectorize(semantic_cache=True, cache_threshold=0.95, auto=True)
def expensive_llm_task(query: str):
    time.sleep(2)  # Simulates LLM API call
    return f"Processed result for: {query}"

# First call: Cache Miss → executes normally (2.0s)
print(expensive_llm_task("How do I fix a Python bug?"))

# Second call: Cache Hit → returns instantly (0.02s!)
print(expensive_llm_task("Tell me how to debug Python code."))

Configuration

cache_threshold

The cosine similarity threshold for considering two inputs as "similar enough."

@vectorize(
    semantic_cache=True,
    cache_threshold=0.95,  # 0.0 to 1.0
)
ThresholdBehaviorUse Case
0.99Very strict — nearly identical inputs onlyFinancial calculations
0.95Recommended default — similar meaningsGeneral LLM caching
0.90Lenient — broader matchesFAQ / Knowledge base
0.85Very lenient — loose semantic matchesCreative / exploratory

capture_return_value

Required for caching. Without this, VectorWave can't return a cached result:

@vectorize(
    semantic_cache=True,
    cache_threshold=0.95,
    capture_return_value=True,  # Stores the return value
)
def my_function(query: str):
    return llm.complete(query)

Cache Scope (Multi-tenancy)

By default, the cache is global. Use semantic_cache_scope to isolate caches by dynamically extracting filter values from function arguments at runtime:

# Per-user cache isolation
@vectorize(
    semantic_cache=True,
    cache_threshold=0.95,
    semantic_cache_scope=["user_id"],  # Extract from function args
)
def personalized_query(user_id: str, query: str):
    return llm.complete(query, user_context=get_user(user_id))

# Multi-dimension cache isolation
@vectorize(
    semantic_cache=True,
    semantic_cache_scope=["user_id", "region"],  # Multiple scope keys
)
def regional_query(user_id: str, region: str, query: str):
    return llm.complete(query, region=region)

When semantic_cache_scope=["user_id"] and the function is called with fn(user_id="U123", query="..."), the cache lookup automatically filters by {"user_id": "U123"}. This ensures each user gets their own isolated cache.

For static filters (not derived from function args), use semantic_cache_filters:

@vectorize(
    semantic_cache=True,
    semantic_cache_filters={"environment": "production"},
)
def production_query(query: str):
    return llm.complete(query)

Golden Dataset Priority

When semantic caching is enabled, VectorWave uses a 2-tier cache lookup:

  1. Priority 1: Golden Dataset — Searched first. These are manually curated, verified results.
  2. Priority 2: Standard Executions — Searched if no Golden match. Uses all successful execution logs.

Golden matches are deterministic — they ensure consistent, high-quality results for known input patterns. See Golden Dataset for management details.

How Cache Lookup Works

Input: "How do I debug Python?"
         │
         ▼
   Embedding Vector
   [0.12, -0.45, 0.78, ...]
         │
         ▼
   Search Golden Dataset (Priority 1)
         │
    ┌────┴────┐
    │ Miss    │ Hit → Return Golden Result
    ▼
   Search Standard Executions (Priority 2)
         │
         ▼
   Cosine Similarity Check
   ┌──────────────────────────┐
   │ Cached: "Fix Python bug" │
   │ Similarity: 0.97         │
   │ Threshold: 0.95          │
   │ Result: HIT ✓            │
   └──────────────────────────┘
         │
         ▼
   Return Cached Result
   (0.02s vs 2.5s)

Performance Impact

MetricWithout CachingWith Caching (Hit)
Latency~2.5s (LLM API)~0.02s
Cost per call~$0.03$0.00
Token usageFullZero

For applications with repetitive queries (customer support, FAQ bots, search), caching typically achieves a 60-90% hit rate, reducing LLM costs proportionally.

Monitoring Cache Performance

Check cache hit rates via VectorWave's execution logs:

from vectorwave import search_executions

# Find all cached executions for a function
cached = search_executions(
    limit=100,
    filters={"function_name": "expensive_llm_task", "cache_hit": True},
    sort_by="timestamp_utc",
)

print(f"Cache hits: {len(cached)}")

VectorSurfer: Real-time cache hit rates, performance metrics, and execution timelines are visualized in the VectorSurfer dashboard — monitor caching performance at a glance.

External Resources

  • OpenAI API Documentation — Embedding API used for semantic similarity

Next Steps

  • Golden Dataset — Curate high-quality data for cache priority
  • Self-Healing — Automatic error diagnosis
  • Drift Detection — Monitor input quality