Replay Testing

Why Replay Testing?

Traditional assert a == b fails for generative AI — the same prompt can produce different valid outputs. VectorWave's replay testing solves this by comparing against known-good executions using both exact match and semantic similarity.

Enabling Replay

Add replay=True to capture inputs and outputs for future replay:

@vectorize(
    replay=True,
    capture_return_value=True,
    capture_inputs=True,
    auto=True,
)
async def generate_response(query: str):
    return await llm.complete(query)

Every successful execution is now stored as a potential test case.

VectorSurfer: Replay testing is available through the VectorSurfer dashboard — select functions, run replays with real-time progress, and compare expected vs actual outputs side-by-side.

Running Replay Tests

Basic Replay

from vectorwave import VectorWaveReplayer

replayer = VectorWaveReplayer()

# Replay last 20 executions of a function (Golden Data prioritized)
results = replayer.replay(
    function_full_name="app.generate_response",
    limit=20,
)

print(f"Passed: {results['passed']}")
print(f"Failed: {results['failed']}")
# → { function: "app.generate_response", total: 20, passed: 18, failed: 2, updated: 0 }

The replayer automatically prioritizes Golden Dataset entries, then falls back to standard execution logs.

Semantic Comparison

For AI/generative functions where outputs may vary, use SemanticReplayer:

from vectorwave import SemanticReplayer

semantic_replayer = SemanticReplayer()

results = semantic_replayer.replay(
    function_full_name="app.generate_response",
    limit=20,
    similarity_threshold=0.85,  # Vector similarity threshold
    semantic_eval=True,          # LLM-based semantic evaluation
)

Best for: generative AI, NLP outputs, summarization.

For deterministic functions (math, data transformations), use the basic VectorWaveReplayer which does exact match comparison.

Updating Baselines

When your function legitimately changes behavior (new model, updated prompts), update the Golden Dataset:

results = replayer.replay(
    function_full_name="app.generate_response",
    limit=20,
    update_baseline=True,  # Update Golden Dataset with new outputs
)

Warning: Only use update_baseline=True when you've verified the new outputs are correct. This overwrites existing Golden entries.

CLI Testing with VectorCheck

For CI/CD integration, use the VectorCheck CLI:

# Install
pip install vectorcheck

# Run all replay tests
vw test --target all

# Semantic comparison mode
vw test --target all --semantic --threshold 0.85

# Test a specific function
vw test --target app.generate_response --semantic

# Export test data for offline analysis
vw export --target app.generate_response --output data.jsonl

CI/CD Integration

GitHub Actions

name: VectorWave Regression Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      weaviate:
        image: semitechnologies/weaviate:1.26.1
        ports:
          - 8080:8080
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install vectorwave vectorcheck
      - run: vw test --target all --semantic --threshold 0.85

Test Strategies

Smoke Test

Quick check that functions still work:

from vectorwave import SemanticReplayer

semantic_replayer = SemanticReplayer()
results = semantic_replayer.replay(
    function_full_name="app.generate_response",
    limit=5,
    similarity_threshold=0.70,  # Low threshold, just check it's reasonable
)
assert results["failed"] == 0

Full Regression

Thorough test with strict threshold:

results = semantic_replayer.replay(
    function_full_name="app.generate_response",
    limit=100,
    similarity_threshold=0.90,  # Strict threshold
    semantic_eval=True,
)
assert results["failed"] / results["total"] < 0.05  # Max 5% failure rate

Next Steps

Golden Dataset — Manage test baselines
RAG Search — Search your codebase with AI
Advanced Configuration — Custom properties and tagging