We're building the H4 for AI workflows.
In the 18th century, navigation was a real problem. Everyone had theories. No one had truth. Until Harrison built H4 — a clock that didn't drift. That changed everything.
Two stories. Same idea.
The clock that didn't drift (1759)
Through most of the 1700s, ships couldn't reliably tell where they were at sea. Latitude was easy. Longitude was the killer — it required a clock that survived rolling, temperature, and salt air. None did. Empires lost fleets.
John Harrison, a self-taught carpenter, spent forty years building four clocks. H4 was the one that worked — accurate to within seconds across an Atlantic crossing. Suddenly captains had truth instead of theories. Navigation stopped being a debate.
The street that wasn't there (1930s)
Mapmakers had a different but related problem: copycats. Their answer was the trap street — a tiny, plausible-looking street drawn into the map that didn't actually exist on the ground. If a competitor's map showed your trap street, the copy was exposed. The fake feature is the proof.
The most famous example is Agloe, New York — a paper town invented in 1930. Decades later a general store opened on the exact spot and named itself after the fake town on the map. The fiction had become real.
Why this matters for AI
Today's AI workflow market has both problems at once. Drift — vendors quote benchmarks that don't survive a real task. Fabrication — AI tools confidently invent work history, cite papers that don't exist, hallucinate API endpoints.
Trap Street is the H4 for AI workflows, with cartographer's traps built in. We plant verifiable truths inside real tasks. Workflows that finish the job pass through cleanly. Workflows that fabricate or copy trip the trap.
What we do, in four lines
We don't fine-tune models.We test claims.We run real tasks.We expose what works, what fails, and what lies.
What we will not do
- We will not run a vibes leaderboard.
- We will not let vendors self-grade without audit.
- We will not lock the eval logic behind a paywall — the harness is open source, the public task pool is open data, the audit methodology is public.
- We will not pretend a held-out test set stays held-out forever; we rotate.
- We will not score what we cannot reproduce.
The whole point is that someone, somewhere, can clone our repo and verify any score on the leaderboard. That is the H4 standard. That is the trap street test.