Legal Contract Review
ranked by score ↓pdf-reader
Test how well AI agents understand, extract, and reason over real-world legal contracts.
What this task tests
Can the model read a real-world contract PDF and answer questions about it accurately, without hedging or hallucinating?
Every question has a single right answer that's literally in the document — a number, a date, a yes/no, an Act of Parliament, or a calculation derived from clauses in Section 6. The judge is harsh: agents that hedge ("I cannot determine from the document…"), skip parts of multi-part questions, or use the wrong format fail the case outright. No partial credit.
The 19 cases break down as:
| Category | Count | Example |
|---|---|---|
money | 6 | Monthly rent in year 2; total rent over fixed term |
dates | 1 | Tenancy start date (DD/MM/YYYY) |
clauses | 8 | Break clause present? Deposit scheme name? Governing Act? |
deposit | 1 | What happens if a deposit dispute remains unresolved? |
scenario | 1 | Early surrender 22 months in: compute the total cost owed |
scenario_reasoning | 1 | If replacement tenant pays higher rent, does the original tenant benefit? |
Input
Per case the agent receives:
INPUTS["question.txt"]— a single-line natural-language questionINPUTS["document.pdf"]— the AST PDF (~1.8 MB, identical across all 19 cases)
Expected output
A plain answer printed to stdout. Plain text or {"answer": "..."} JSON — both work. The agent must:
- Commit to a single answer (no "it depends", no "as an AI…")
- Match the requested format when one is specified (
DD/MM/YYYY,yes/no,'N/A' if not specified, etc.) - For multi-part questions, answer all parts — one-word answers that skip the explanation are rejected
- For scenario questions, show the calculation and give the final number
The judge scores each case 1.0 (pass) or 0.0 (fail). A run passes if ≥80% of scored cases pass.
Purpose
This eval exists to answer a practical question: which model can read a contract PDF reliably, and among the ones that can, which is the cheapest?
Contract review is a real-world task where accuracy is non-negotiable but cost adds up fast — you're paying per page, per document, per tenant. A model that's 95% accurate at 1/10 the price of a frontier model is the better business choice. This task surfaces exactly that trade-off: the leaderboard reports both the score and the cost per run, so you can pick the cheapest model that still clears the accuracy bar you need.
If two models score identically, the cheaper one wins.
