Tasks

· Pick a task. Run it locally. Submit your score.

+ New Task Get Started

All tasks16 boards

▲ 100%community38 runs
Karpathy's jagged questions — the 50 m car wash
A trap task built around an anecdote Andrej Karpathy told at the Sequoia AI Ascent fireside chat (April 2026): a state-of-the-art model can refactor a 100k-line codebase, yet advises you to walk to the car wash 50 meters away
by xiaotianhan9110d ago
▲ 92%minerals14 runs
Mineral Identification from Field Observations
Given a field geologist's hand-specimen observations of an unknown mineral — crystal system, Mohs hardness, streak color, body color, luster, and specific gravity — name the single most likely mineral species out of 98 candidates. This task measures how well a model identifies mi
by Zhuaiz19d ago
1 profilepersonality10 runs
MBTI Self-Profile
A trap-compatible task that asks each model to take a 32-question Likert MBTI questionnaire from its own point of view. The judge then computes the 4-letter type and per-axis percentages from the model's responses.
by Ruqii2mo ago
▲ 95%pdf-reader7 runs
Legal Contract Review
Test how well AI agents understand, extract, and reason over real-world legal contracts.
by Ruqii1mo ago
▲ 89%skill6 runs
influencer_marketing_disclosure
Given a self-contained influencer/creator-partnership scenario, the solution must give correct guidance on the parts of influencer marketing that have actual right answers: whether FTC disclosure is required, how to set up attribution when links aren't clickable, whether to write
by Ruqii4d ago
▲ 100%debugging4 runs
debug_subscription_billing_pipeline — Multi-Report Consistency Debugging
An open-source evaluation task for cross-file, cross-report consistency debugging — when a ticket asks for a change to a SaaS billing pipeline, does the agent identify ALL the places that need updating so that FOUR reports (each resolving the same underlying facts through a d
by Ruqii1d ago
▲ 100%code-review3 runs
Debugging in a vendor payout senario - not many cases, but hard 😬
An open-source evaluation task for cross-file consistency debugging — when a ticket asks for a change to a data pipeline, does the agent identify ALL the places that need updating so that TWO reports (with DIFFERENT lookup paths) both come out correct?
by Ruqii4d ago
▲ 80%vision2 runs
Can your model identify animals?
This repository contains a vision test where the job is simple: look at a wildlife photo and say which animal species it shows.
by Alfred19d ago
▲ 100%video-game2 runs
LLM Plays REAL Minecraft - Can Your AI Mine a Diamond? 💎
Can your model actually play Minecraft and come out holding a diamond? This is the classic long-horizon agent benchmark: from an empty inventory, your agent must climb the entire tech tree, every step gated by the last: 🪵 punch wood → 🛠️ crafting table + wooden pickaxe → ⛏️ mine cobblestone → stone pickaxe → ⚙️ find & mine iron ore → 🔥 smelt it into an iron ingot → iron pickaxe → 🕳️ dig deep and mine diamond ore 💎
by Ruqii20d ago
▲ 50%code-review-skill1 run
👩‍💻 which code review skill work the best?
A code-review Claude Skill (SKILL.md) is shown one real source file, frozen at the moment just before a real historical bug was fixed, and must find the bug. Ground truth is the actual fix commit — not a synthetic injected bug.
by Ruqii10h ago
▲ 38%tasks1 run
CUAD — legal contract clause extraction
Real commercial contracts (SEC EDGAR material agreements) go in; the model must return the exact clause span for one of 41 clause types — or correctly say the clause is absent. See ATTRIBUTION.md (CUAD, CC BY 4.0) for provenance and license hygiene.
by Ruqii1mo ago
open boardfoundation0 runs
Can your agent read a scanned PDF?
An open-source evaluation task for PDF OCR / vision-based document reading. Useful as a basic sanity check when building AI agents that ingest PDFs — invoice processors, legal-doc reviewers, receipt extractors, research assistants, accessibility tools.
core-pdf-ocr19d ago
open boardfoundation0 runs
Can your agent find the right fact in a long doc?
An open-source evaluation task for fact retrieval from context — in the shapes AI agents actually encounter: RAG returns, tool outputs, and long documents.
core-needle-in-haystack19d ago
open boardfoundation0 runs
Does your model follow a function-call schema?
An open-source evaluation task for structured output / function-call compliance. Useful as a basic sanity check when building AI agents that rely on models producing valid, schema-conforming JSON — the foundation of any tool-use, function-calling, or structured-response pipel
core-json-schema-output19d ago
open boardfoundation0 runs
Can your agent do date + time math?
An open-source evaluation task for temporal arithmetic — the everyday "3 days from Tuesday" / "what year did X end" / "what time is it in Tokyo when it's 9am in NYC" computations that any scheduling, booking, calendar, or finance agent needs to get right.
core-date-arithmetic19d ago
open boardfoundation0 runs
Does your agent know what it doesn't know?
An open-source evaluation task for calibrated uncertainty — does the model honestly decline questions it can't answer, or does it confidently make something up?
core-calibrated-answer19d ago