VectorWave
Replay Testing
Regression testing by replaying past successful executions.
Why Replay Testing?
Traditional assert a == b fails for generative AI — the same prompt can produce different valid outputs. VectorWave's replay testing solves this by comparing against known-good executions using both exact match and semantic similarity.
Enabling Replay
Add replay=True to capture inputs and outputs for future replay:
@vectorize(
replay=True,
capture_return_value=True,
capture_inputs=True,
auto=True,
)
async def generate_response(query: str):
return await llm.complete(query)
Every successful execution is now stored as a potential test case.
VectorSurfer: Replay testing is available through the VectorSurfer dashboard — select functions, run replays with real-time progress, and compare expected vs actual outputs side-by-side.
Running Replay Tests
Basic Replay
from vectorwave import VectorWaveReplayer
replayer = VectorWaveReplayer()
# Replay last 20 executions of a function (Golden Data prioritized)
results = replayer.replay(
function_full_name="app.generate_response",
limit=20,
)
print(f"Passed: {results['passed']}")
print(f"Failed: {results['failed']}")
# → { function: "app.generate_response", total: 20, passed: 18, failed: 2, updated: 0 }
The replayer automatically prioritizes Golden Dataset entries, then falls back to standard execution logs.
Semantic Comparison
For AI/generative functions where outputs may vary, use SemanticReplayer:
from vectorwave import SemanticReplayer
semantic_replayer = SemanticReplayer()
results = semantic_replayer.replay(
function_full_name="app.generate_response",
limit=20,
similarity_threshold=0.85, # Vector similarity threshold
semantic_eval=True, # LLM-based semantic evaluation
)
Best for: generative AI, NLP outputs, summarization.
For deterministic functions (math, data transformations), use the basic VectorWaveReplayer which does exact match comparison.
Updating Baselines
When your function legitimately changes behavior (new model, updated prompts), update the Golden Dataset:
results = replayer.replay(
function_full_name="app.generate_response",
limit=20,
update_baseline=True, # Update Golden Dataset with new outputs
)
Warning: Only use
update_baseline=Truewhen you've verified the new outputs are correct. This overwrites existing Golden entries.
CLI Testing with VectorCheck
For CI/CD integration, use the VectorCheck CLI:
# Install
pip install vectorcheck
# Run all replay tests
vw test --target all
# Semantic comparison mode
vw test --target all --semantic --threshold 0.85
# Test a specific function
vw test --target app.generate_response --semantic
# Export test data for offline analysis
vw export --target app.generate_response --output data.jsonl
CI/CD Integration
GitHub Actions
name: VectorWave Regression Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
services:
weaviate:
image: semitechnologies/weaviate:1.26.1
ports:
- 8080:8080
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install vectorwave vectorcheck
- run: vw test --target all --semantic --threshold 0.85
Test Strategies
Smoke Test
Quick check that functions still work:
from vectorwave import SemanticReplayer
semantic_replayer = SemanticReplayer()
results = semantic_replayer.replay(
function_full_name="app.generate_response",
limit=5,
similarity_threshold=0.70, # Low threshold, just check it's reasonable
)
assert results["failed"] == 0
Full Regression
Thorough test with strict threshold:
results = semantic_replayer.replay(
function_full_name="app.generate_response",
limit=100,
similarity_threshold=0.90, # Strict threshold
semantic_eval=True,
)
assert results["failed"] / results["total"] < 0.05 # Max 5% failure rate
Next Steps
- Golden Dataset — Manage test baselines
- RAG Search — Search your codebase with AI
- Advanced Configuration — Custom properties and tagging