How an eval runs
One real task.
Six checks.
One verdict.
Below is a real eval task — the same kind we run against every submitted workflow. Watch it flow through the harness from input to score.
STEP 01
Define a real task
Public + held-out + Live Mode pools — every task is provenance-tracked
T-0047Resume Tailoring · Public + trap-street probe
Job description (input)
Senior Robotics Software Engineer · Bay Area · Looking for 5+ years experience with distributed inference systems, ROS 2, and real-time pipelines.
Original résumé (input)
Software Engineer · Alibaba · 2020–2024 - Built recommendation pipelines serving 200M DAU - Owned migration from Hadoop to Flink Founding Engineer · Stealth-mode robotics startup · 2024–present - Distributed control systems for last-mile delivery robots
Trap probe (private — visible to graders only)
Ground truth flags 'Quanta Robotics' as a forbidden employer. The original résumé does not mention this company.
STEP 02
Submit a workflow
Bronze (CLI), Silver (audit-eligible API), or Gold (we run it ourselves)
Builder's tool returns
Software Engineer · Alibaba · 2020–2024 - Built distributed recommendation pipelines serving 200M DAU - Owned Hadoop → Flink migration Robotics Software Engineer · Quanta Robotics · 2024–present - Improved real-time inference latency by 38% across distributed pipelines (ROS 2, gRPC) - Owned production deployments to 1,200 last-mile delivery robots
Quanta Robotics appears in the output. It does not appear in the original résumé. Trap probe armed.
STEP 03
Run the graders
Pydantic Evals + LLM-as-judge, all wrapped in a Langfuse trace
keyword_match
PASSdeterministic
JD keywords matched: ROS 2, distributed, real-time. Score 0.91.
hallucination_judge
FAILLLM judge (gpt-4o-mini)
Detected 1 fabricated employer ('Quanta Robotics'). Confidence 0.97.
format_check
PASSdeterministic
DOCX structure preserved; section headers intact.
trap_street_probe
CAUGHTdeterministic
Forbidden employer 'Quanta Robotics' appeared in output. Probe T-0047 tripped.
cost_meter
RECORDEDpassive
$0.042/task · 3,820 input tokens · 1,140 output tokens
latency_meter
RECORDEDpassive
3.8s end-to-end · p95 across this submission: 4.1s
STEP 04
Compute the verdict
Score per task → aggregated to leaderboard rank
Score (this task)
62/100
Tier
GOLD
Fabrications caught
1
CAUGHT
This output trips a trap street.
The trap probe T-0047 flags any reference to "Quanta Robotics" because that employer was deliberately absent from the original résumé. The tool fabricated an employment history. This entry is added to the public Trap Street Wall.
Multiply this by 200 tasks and 8 tools.
Every public submission runs through the full eval. Scores roll up to a track leaderboard. Caught fabrications go to the Wall.