Ecosystem
VectorCheck
CLI regression testing framework for AI/LLM applications.
What is VectorCheck?
VectorCheck is a CLI regression testing framework designed for AI/LLM applications. Traditional assert a == b fails for generative AI — the same prompt can produce different valid outputs. VectorCheck solves this with vector similarity and LLM judge evaluation.
pip install vectorcheck
Why VectorCheck?
| Approach | How It Works | Limitation |
|---|---|---|
assert a == b | Exact string match | Fails for AI outputs — same meaning, different words |
| VectorCheck Exact | Character-by-character | For deterministic functions only |
| VectorCheck Semantic | Embedding cosine similarity | Handles paraphrasing and variation |
| VectorCheck LLM Judge | GPT-4 evaluates equivalence | Most flexible, handles complex outputs |
CLI Commands
Run Tests
# Test all tracked functions
vw test --target all
# Test a specific function
vw test --target app.generate_response
# Semantic comparison mode
vw test --target all --semantic --threshold 0.85
# LLM judge mode
vw test --target all --judge --model gpt-4-turbo
Export Data
# Export execution logs to JSONL
vw export --target app.generate_response --output data.jsonl
# Export with filters
vw export --target all --status success --output successes.jsonl
# Export Golden Dataset only
vw export --target all --golden --output golden.jsonl
Inspect Functions
# List all tracked functions
vw list
# Show details for a specific function
vw inspect app.generate_response
# Show recent executions
vw history app.generate_response --limit 20
Testing Modes
Exact Match
Compares outputs character-by-character. Best for deterministic functions.
vw test --target app.calculate_total --exact
Pass criteria: outputs must be identical.
Semantic Comparison
Compares outputs using embedding cosine similarity. Best for AI/NLP outputs.
vw test --target app.generate_response --semantic --threshold 0.85
| Threshold | Strictness | Use Case |
|---|---|---|
0.95 | Very strict | Factual Q&A, summaries |
0.85 | Recommended | General LLM outputs |
0.75 | Lenient | Creative writing, open-ended |
LLM Judge
Uses GPT-4 to evaluate whether two outputs are semantically equivalent.
vw test --target app.generate_response --judge --model gpt-4-turbo
The LLM judge considers:
- Semantic meaning
- Factual accuracy
- Completeness
- Tone and style (configurable)
CI/CD Integration
GitHub Actions
name: AI Regression Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
services:
weaviate:
image: semitechnologies/weaviate:1.26.1
ports:
- 8080:8080
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install vectorwave vectorcheck
- run: vw test --target all --semantic --threshold 0.85
Exit Codes
| Code | Meaning |
|---|---|
0 | All tests passed |
1 | One or more tests failed |
2 | Configuration error |
Configuration File
Create vectorcheck.yml in your project root for persistent configuration:
# vectorcheck.yml
default_mode: semantic
default_threshold: 0.85
targets:
- name: app.generate_response
mode: semantic
threshold: 0.90
- name: app.calculate_total
mode: exact
- name: app.creative_writer
mode: judge
model: gpt-4-turbo
Then simply run:
vw test # Uses configuration from vectorcheck.yml
Relationship to VectorWave
VectorCheck reads from the same Weaviate instance as VectorWave:
Your App + @vectorize → Weaviate ← VectorCheck CLI
The @vectorize(replay=True) decorator stores inputs and outputs that VectorCheck uses as test cases. The Golden Dataset provides verified baselines.
Next Steps
- VectorWave Replay Testing — Programmatic replay API
- VectorSurfer Replay UI — Visual testing interface
- Contributing — How to contribute