Reproducibility commitment

We publish the methodology,
the code, and the failures.

AI products earn trust by showing the work, not by quoting a number. Here is how we operate, what we publish, and what we refuse to claim.

We publish the runner code

Every number on the landing page comes from a script we ship with the SDK. Same script, your API key, your numbers. No proprietary harness.

We publish the raw judge logs

When a measurement uses an LLM as judge, the full transcripts and verdicts ship with the result. You can audit every call. Per-question pass/fail visible.

We publish what failed

When a measurement contradicts an earlier claim we made, the contradiction goes up next to the favorable number, not in a footnote. Failure transparency over hero numbers.

We don't claim what we haven't run

If we haven't measured it, we won't quote it, and we'd rather owe you the number than ship a partial one. When we run it, the raw output goes out same day.

What to be suspicious of

Five red flags in agent-memory benchmarks. Apply them to us too.

  • 100% accuracy on a public benchmark, usually means the eval was bypassed
  • Perfect-score claim shipped without runner code in repo
  • Marketing pages with metrics absent from the project's own BENCHMARKS.md
  • Headline metric that secretly measures the underlying vector store, not the system
  • Per-question fix patches counted as 'architectural improvements'

If you find one of these in our repo, open a GitHub issue. We update or retract within 48 hours.