Reference

Config fields, CLI commands, the manifest contract, and the metrics the platform recognises — for looking up, not reading through.

Look things up here. For the exhaustive, every-field-annotated configs, read the two files in the CLI repo: trap.yaml and traptask.yaml.

CLI commands

Command	Does
`tp auth login`	one-time browser OAuth → stores your CLI token (`tp auth status` / `logout` to inspect / clear)
`tp run`	run every case locally, judge, write `.trap/<task>/<ts>/report.json`
`tp run --no-cost`	same, with cost tracking off
`tp report`	re-render a stored run without re-executing it
`tp submit`	upload the latest report (provenance locates the task)
`tp submit --run <ts>`	upload a specific stored run (`-w` picks the workspace)
`--allow-unanchored`	on run/submit: skip the confirmation for a checkout with no git provenance (dirty / no remote — lands as a local try-out, never ranks). Env: `TRAP_ALLOW_UNANCHORED=1`

Point the CLI at another server with TRAPSTREET_URL=https://…; in CI, skip the browser with TRAPSTREET_API_KEY (or tp auth login --with-token).

trap.yaml — solution config

Field	Req	Meaning
`cmd`	✓	shell command that runs your solution
`profile.model` / `.framework`		self-reported engine identity (scalar or list) → run page
`stdin`		input filename piped to your solution's stdin
`setup_cmd`		run once after clone (e.g. `uv sync`)
`name`		leaderboard identity (else the server auto-assigns one)
`timeout`		per-case ceiling, seconds (default 600)
`tasks.<alias>.source`	✓	the task: a local path or `git+https://…@rev`

<alias> should equal the task's id on trapstreet so tp submit resolves with no flag. Cost tracking is a CLI flag, not a yaml field.

traptask.yaml — task config

Field	Req	Meaning
`name`		human-readable task title
`dirs.inputs` / `.expected`		fixture dirs (default `inputs/` / `expected/`)
`cases[].id`	✓	one per `inputs/<id>/` directory
`cases[].tags` / `.skip` / `.description`		filter / skip / annotate a case
`judge.cmd`		scores one case; omit to leave cases unscored
`grader.cmd`		aggregates to a run summary; omit → server averages at ≥ 0.8
`declared_outputs`		advisory list of output filenames / `stdout`

The manifest your code reads

Consumer	Env var	JSON
solution	`TRAP_MANIFEST`	`{inputs_dir, outputs_dir}`
judge	`TRAPTASK_MANIFEST`	`{inputs_dir, expected_dir, outputs_dir, run:{stdout,stderr,meta}}`
grader	`TRAPTASK_MANIFEST`	list of per-case results

All values are absolute paths — directories, or files for run.*. Join a directory with the filename you authored.

Metrics dictionary

Whatever judge.py / grader.py print becomes cases.metrics and the run's grader output. The platform interprets exactly one key:

Key	From	Effect
`score`	judge + grader	0–1. The grader's is the run score (fallback: mean of case scores); a case counts as passed at 1.0
`agent_answer` / `expected` / `reason`	judge	rendered in the per-case view

Everything else is stored and shown verbatim on the run page — there is no run-level passed and correct is not interpreted. Cost and latency are not grader keys: the CLI records per-case spend (cost.by_model) and run timestamps in the report, and the server derives the leaderboard columns from those.

Repos

trap-cli on PyPI — the tp command
trapstreet — the single repo: this website + the CLI (cli/, with the annotated examples above)