Reproducibility commitment

We publish the methodology,
the code, and the failures.

AI products earn trust by showing the work, not by quoting a number. Here is how we operate, what we publish, and what we refuse to claim.

We publish the runner code

Every number on the landing page comes from a script we ship with the SDK. Same script, your API key, your numbers. No proprietary harness.

We publish the raw judge logs

When a measurement uses an LLM as judge, the full transcripts and verdicts ship with the result. You can audit every call. Per-question pass/fail visible.

We publish what failed

When a measurement contradicts an earlier claim we made, the contradiction goes up next to the favorable number, not in a footnote. Failure transparency over hero numbers.

We don't claim what we haven't run

If we haven't measured it, we won't quote it, and we'd rather owe you the number than ship a partial one. When we run it, the raw output goes out same day.

What to be suspicious of

Five red flags in agent-memory benchmarks. Apply them to us too.

100% accuracy on a public benchmark, usually means the eval was bypassed
Perfect-score claim shipped without runner code in repo
Marketing pages with metrics absent from the project's own BENCHMARKS.md
Headline metric that secretly measures the underlying vector store, not the system
Per-question fix patches counted as 'architectural improvements'

If you find one of these in our repo, open a GitHub issue. We update or retract within 48 hours.

Get the partner →Back to landing