Trapstreet.run

docs

Reference

The four design docs that decide how Trap Street works under the hood. Most days you won't need these. They're here for when you do.

Scoring and metrics

How a task gets scored and what the leaderboard shows. The short version: your grader.py prints a JSON object with {passed, score, ...}, server picks up well-known keys (cost_usd_total, latency_ms_total, by_category, …) and renders columns. No configuration needed for 90% of tasks. The doc covers the full key list, the wire format the CLI uploads, and the opt-in dashboard: block for tasks that need custom columns.

Full spec on GitHub →

Trust tiers

Two tiers, one axis: who runs the eval. Self-reported (free, default today) — you run on your machine, we record what you upload. Verified (paid, post-MVP) — we run in a sandbox with held-out inputs and an LLM-API proxy so the numbers are ground truth, not self-report. The doc explains the economics (~50× cost reduction vs all-we-run) and the cheating mitigations.

Full spec on GitHub →

Glossary

Every word in trapstreet, defined once. Solution, task, run, case, metric, judge, grader, leaderboard, solution. Two pages. Useful when a term in the UI doesn't mean what you'd guess (especially passed — it's whatever the grader decides, not exit-code-based).

Full spec on GitHub →

API v0

Every HTTP endpoint, request/response shape, status state machine. The CLI talks to this; if you build a custom uploader or a CI integration, this is the contract. Stable — breaking changes get a v1.

Full spec on GitHub →

repos