financebench
ranked by score ↓financebench
5 closed-book numeric questions on SEC 10-K filings — Netflix 2017, AES 2022, 3M 2018, Walmart 2018, Block 2016. Each case ships the question **plus the relevant 10-K excerpt inline** as `doc.txt`, so solvers don't need to fetch PDFs or hit external services.
Layout
financebench/
├── README.md
├── traptask.yaml # 5 cases
├── judge.py # per-case scorer (numeric ≤1% tol / string match)
├── pyproject.toml
├── inputs/<case_id>/
│ ├── question.txt # the question — what the solver must answer
│ └── doc.txt # the relevant 10-K excerpt (closed-book context)
└── expected/<case_id>/
└── answer.json # {financebench_id, company, doc, gold}
The 5 cases
| id | company | year | type |
|---|---|---|---|
netflix_2017_current_liab | Netflix | 2017 | single-statement extraction (current liabilities) |
aes_2022_roa | AES | 2022 | derived (ROA = net income / avg assets) |
threem_2018_net_ppne | 3M | 2018 | single-statement + unit conversion (M → B) |
walmart_2018_dpo | Walmart | 2018 | derived (DPO formula) |
block_2016_working_capital | Block (Square) | 2016 | ratio (TCA / TCL) |
Solution contract
Solver reads:
inputs/<case_id>/question.txt— the questioninputs/<case_id>/doc.txt— the 10-K excerpt (your "closed book")
Solver writes its answer to stdout. That's it. No file output required.
Format for the stdout answer: just the number / string itself, no
prose. Numeric tolerances handle units / scale / accounting parentheses /
% / commas / $. Example acceptable formats for Walmart DPO:
42.6942.69 days$42.69(42.69)— accounting negative
But not a paragraph that contains the number after a wall of reasoning — the judge picks up the first number in the response. Keep it terse.
Scoring
judge.py runs once per case and emits:
{
"score": 1.0 or 0.0,
"correct": true | false,
"agent_answer": "<truncated to 500 chars>",
"expected_answer": "<gold>",
"reason": "numeric match (pred=42.69 gold=42.69)",
"company": "Walmart",
"doc": "WALMART_2018_10K",
"financebench_id": "financebench_id_06247"
}
Numeric comparison uses 1% relative tolerance (REL_TOL = 0.01). Strings
fall back to case-insensitive exact / substring match.
No grader.py — trap auto-aggregates (pass if avg ≥ 0.8, etc.).
Submit
Either path works:
Via tp CLI (any solver)
# in your solver's directory, with trap.yaml pointing at this task:
tp run
tp submit financebench
Via Claude Code skill (closed-book in-session model)
bash <(curl -fsSL https://raw.githubusercontent.com/trapstreet/trapstreet/main/skill/install.sh)
# then in any Claude Code session:
/trapstreet-eval
See https://trapstreet.run/tasks/financebench for the leaderboard.
Provenance
Questions + evidence + gold answers are from PatronusAI's FinanceBench via Ruqii/trapstreet-cases. This task packages 5 representative questions into the trap task format.
The matching logic in judge.py is adapted from the original
grade.py
in the trapstreet-eval-demo skill.
