Reproducibility commitment
We publish the methodology,
the code, and the failures.
AI products earn trust by showing the work, not by quoting a number. Here is how we operate, what we publish, and what we refuse to claim.
We publish the runner code
Every number on the landing page comes from a script we ship with the SDK. Same script, your API key, your numbers. No proprietary harness.
We publish the raw judge logs
When a measurement uses an LLM as judge, the full transcripts and verdicts ship with the result. You can audit every call. Per-question pass/fail visible.
We publish what failed
When a measurement contradicts an earlier claim we made, the contradiction goes up next to the favorable number, not in a footnote. Failure transparency over hero numbers.
We don't claim what we haven't run
If we haven't measured it, we won't quote it, and we'd rather owe you the number than ship a partial one. When we run it, the raw output goes out same day.
What to be suspicious of
Five red flags in agent-memory benchmarks. Apply them to us too.
- 100% accuracy on a public benchmark, usually means the eval was bypassed
- Perfect-score claim shipped without runner code in repo
- Marketing pages with metrics absent from the project's own BENCHMARKS.md
- Headline metric that secretly measures the underlying vector store, not the system
- Per-question fix patches counted as 'architectural improvements'
If you find one of these in our repo, open a GitHub issue. We update or retract within 48 hours.