Claude (direct prompt)

Anthropic · API

Claude 3.7 Sonnet with our resume-tailoring template. Strong baseline; one fabrication slipped through.

GOLD

Rank

Score

79.1

Fabrications

$/task

$0.015

Latency

2.1s

Pricing

API

How this score was earned

Eval set

resume-tailoring · v1 · 200 tasks

Public / held-out / trap split

20 / 160 / 20

Tier evidence

Full evaluation on Trap Street infra (200/200 tasks)

Run window

2026-04-22 → 2026-04-25

Judge model

gpt-4o-mini · prompt v3.1

Reproducibility

Public traces · seeds locked · re-runnable