Preview · this is a sketch of how a Financial Extraction track lands on trapstreet.run. Not on the main IA yet — see /vision for context.← back to /vision

financebench-t1aFinancial Extraction · public benchmarkfully transparent

FinanceBench T1-A · 5 SEC 10-K extraction questions

Five real questions from PatronusAI's FinanceBench. Numeric extraction from real 10-Ks — total current liabilities, ROA, days payable outstanding, working capital ratio. Every grader, every gold answer, every tolerance is public on this page.

FinanceBench's headline result: GPT-4-Turbo gets 81% of the original 150-question dataset wrong on closed-book PDF mode. That number is what made the dataset a benchmark; we use the T1-A subset to make it a 30-second demo.

Run via the OSS skill Back to all tracks

Questions in subset

Full dataset

15 +

Tolerance

1% rel.

License

CC-BY-NC-4.0

dataset only · code MIT

LLM judge

fallback only

Run it · 30 seconds

One curl line, then `/trapstreet-eval`.

curl -fsSL https://raw.githubusercontent.com/AntiNoise-ai/trapstreet-eval-demo/main/skill/install.sh | bash

Installs three files into ~/.claude/skills/trapstreet-eval/. Then in any Claude Code session, type /trapstreet-eval. Claude reads each question, answers it from the bundled evidence (no web search), grades against the gold key with grade.py, and prints a Markdown leaderboard. ~30 seconds, $0, no API key.

Read grade.py View the OSS repo

The five questions · transparent

Every question. Every gold answer. Every tolerance.

This is what "transparent" means in our trust model: nothing is hidden. Anyone can read the question, the gold answer, and the grader code that judges it. No surprises, no "trust us" — just receipts.

Q1 · financebench_id_03282Netflix · NETFLIX_2017_10K

gold: $5466.00

What is Netflix's year-end FY2017 total current liabilities (USD millions)?

Evidence digest

Total current liabilities row: 5,466,312 (2017) / 4,586,657 (2016). Filing is in thousands — answer is 5466 USD M.

Q2 · financebench_id_10420AES Corporation · AES_2022_10K

gold: -0.02

What is AES's FY2022 return on assets (ROA = FY2022 net income / average total assets between FY2021–FY2022)? Round to two decimals.

Evidence digest

Net income attributable to AES Corp: $-546M. Total assets: 38,363 (2022) / 32,963 (2021). Avg = 35,663. ROA = -546 / 35,663 ≈ -0.0153 → −0.02.

Q3 · financebench_id_046723M · 3M_2018_10K

gold: $8.70

What is 3M's year-end FY2018 net property, plant, and equipment (PP&E)? Answer in USD billions.

Evidence digest

Property, plant and equipment, net: 8,738 (2018) / 8,866 (2017). Filing in millions — answer rounds to $8.74B; gold is $8.70 within the grader's 1% tolerance.

Q4 · financebench_id_06247Walmart · WALMART_2018_10K

gold: 42.69

What is Walmart's FY2018 days payable outstanding (DPO = 365 × avg AP / (COGS + Δ inventory))? Round to two decimals.

Evidence digest

AP: 46,092 (2018) / 41,433 (2017) → avg 43,762.5. COGS: 373,396. ΔInventory: 43,783 − 43,046 = 737. DPO = 365 × 43,762.5 / 374,133 = 42.69.

Q5 · financebench_id_04660Block (Square) · BLOCK_2016_10K

gold: 1.73

What is Block's FY2016 working capital ratio (total current assets / total current liabilities)? Round to two decimals.

Evidence digest

Total current assets: 1,001,425. Total current liabilities: 577,464. Ratio = 1,001,425 / 577,464 = 1.7341 → 1.73.

How a question is graded

Three checks, in order. The first one that decides, decides.

numeric_normalize01

deterministic

Parses values like $1.2 billion, 1,200,000,000, 12.5%, (123) → canonical floats. Compare with 1% relative tolerance. Most numeric questions resolve here.

string_match02

deterministic

Used for short qualitative answers (yes/no, single phrase). Case-insensitive normalised string comparison.

llm_judge_fallback03

LLM judge

Only fires when the deterministic checks can't decide. Cost: fractions of a cent per question via Claude Haiku. Disable with --no-judge.

Test your AI on these five questions.

Install the skill, type /trapstreet-eval, read your score in 30 seconds. Or fork the repo and add your own agent to the head-to-head leaderboard.

Install the skill Browse other tracks

FinanceBench T1-A · 5 SEC 10-K extraction questions

One curl line, then /trapstreet-eval.

Every question. Every gold answer. Every tolerance.

Three checks, in order. The first one that decides, decides.

Test your AI on these five questions.

One curl line, then `/trapstreet-eval`.