Cross-Timezone Scheduler

ranked by score ↓

cross-timezone

A trap-compatible task that asks an agent to schedule a meeting across attendees in different time zones, given each attendee's local availability window. The agent must return a JSON object with a single canonical meeting time in UTC plus each attendee's local start time.

2 cases

Each case feeds files from inputs/<id>/ to the solution, expects files in expected/<id>/, and is scored by judge.py then aggregated by grader.py.

traptask.yaml · source on GitHub

cases (2)

▸dst_gap_with_ist60-min meeting across SF/London/Mumbai on 2026-03-26 — UK still on GMT (DST starts March 29), India on UTC+5:30 (no DST). Tests DST-boundary + half-hour zone math simultaneously.

input

question.txt

You are a scheduling assistant.

Schedule a 60-minute meeting TOMORROW with the following attendees and their LOCAL availability windows.

Today is 2026-03-25 (Wednesday).

Attendees:
- Alice  — San Francisco  (America/Los_Angeles)  — available 07:00–09:00 local
- Bob    — London          (Europe/London)        — available 14:00–16:00 local
- Priya  — Mumbai          (Asia/Kolkata)         — available 19:30–21:30 local

Pick any 60-minute slot that fits inside ALL three local availability windows. Account for daylight-saving time on the actual date.

Return ONLY a JSON object (no commentary, no markdown fences) with this exact schema:

{
  "start_utc": "<ISO 8601 timestamp in UTC, e.g. 2026-03-26T14:00:00Z>",
  "duration_min": 60,
  "attendees": [
    {"name": "Alice", "tz": "America/Los_Angeles", "local_start": "YYYY-MM-DD HH:MM"},
    {"name": "Bob",   "tz": "Europe/London",        "local_start": "YYYY-MM-DD HH:MM"},
    {"name": "Priya", "tz": "Asia/Kolkata",         "local_start": "YYYY-MM-DD HH:MM"}
  ]
}

expected output

answer.json

{
  "id": "dst_gap_with_ist",
  "category": "dst_boundary",
  "difficulty": "hard",
  "duration_min": 60,
  "expected_start_utc_min": "2026-03-26T14:00:00Z",
  "expected_start_utc_max": "2026-03-26T15:00:00Z",
  "attendees": [
    {
      "name": "Alice",
      "tz": "America/Los_Angeles",
      "available_local_min": "2026-03-26T07:00:00",
      "available_local_max": "2026-03-26T09:00:00"
    },
    {
      "name": "Bob",
      "tz": "Europe/London",
      "available_local_min": "2026-03-26T14:00:00",
      "available_local_max": "2026-03-26T16:00:00"
    },
    {
      "name": "Priya",
      "tz": "Asia/Kolkata",
      "available_local_min": "2026-03-26T19:30:00",
      "available_local_max": "2026-03-26T21:30:00"
    }
  ],
  "_canonical_answer": {
    "start_utc": "2026-03-26T14:00:00Z",
    "alice_local": "2026-03-26 07:00",
    "bob_local": "2026-03-26 14:00",
    "priya_local": "2026-03-26 19:30"
  },
  "_notes": "DST trap: US DST'd on 2026-03-08 (UTC-7 PDT). UK DST starts 2026-03-29 — Bob is still on GMT (UTC+0) on this date. Priya is IST (UTC+5:30, no DST). The accepted UTC window is [14:00Z, 15:00Z] (start times that fit a 60-min slot inside everyone's availability)."
}

Scored by judge.py — see Scoring logic below for the full rule.

▸dst_quarter_hour_sydney60-min meeting across SF/London/Mumbai/Kathmandu/Sydney on 2026-03-26 — UK-still-GMT + IST half-hour + Nepal QUARTER-hour (+05:45) + Sydney southern-hemisphere DST (AEDT) + local-calendar day-shift for Sydney. Five independent traps; only one valid start time exists.

input

question.txt

You are a scheduling assistant.

Schedule a 60-minute meeting TOMORROW (Thursday 2026-03-26) with the following attendees and their LOCAL availability windows.

Today is 2026-03-25 (Wednesday).

Attendees:
- Alice  — San Francisco   (America/Los_Angeles)  — available 06:00–08:00 local
- Bob    — London           (Europe/London)        — available 14:00–15:30 local
- Priya  — Mumbai           (Asia/Kolkata)         — available 19:00–21:00 local
- Niraj  — Kathmandu        (Asia/Kathmandu)       — available 19:30–21:30 local
- Sam    — Sydney           (Australia/Sydney)     — available 00:30–02:30 local (early-morning slot)

Notes:
- Account for daylight-saving time on the actual date.
- Sam is on Australian Eastern time and may experience the meeting on a different LOCAL calendar day from everyone else.
- Pick any 60-minute slot that fits inside ALL FIVE local availability windows.

Return ONLY a JSON object (no commentary, no markdown fences) with this exact schema:

{
  "start_utc": "<ISO 8601 timestamp in UTC, e.g. 2026-03-26T14:00:00Z>",
  "duration_min": 60,
  "attendees": [
    {"name": "Alice", "tz": "America/Los_Angeles", "local_start": "YYYY-MM-DD HH:MM"},
    {"name": "Bob",   "tz": "Europe/London",        "local_start": "YYYY-MM-DD HH:MM"},
    {"name": "Priya", "tz": "Asia/Kolkata",         "local_start": "YYYY-MM-DD HH:MM"},
    {"name": "Niraj", "tz": "Asia/Kathmandu",       "local_start": "YYYY-MM-DD HH:MM"},
    {"name": "Sam",   "tz": "Australia/Sydney",     "local_start": "YYYY-MM-DD HH:MM"}
  ]
}

expected output

answer.json

{
  "id": "dst_quarter_hour_sydney",
  "category": "multi_zone_expert",
  "difficulty": "expert",
  "duration_min": 60,
  "expected_start_utc_min": "2026-03-26T14:00:00Z",
  "expected_start_utc_max": "2026-03-26T14:00:00Z",
  "attendees": [
    {
      "name": "Alice",
      "tz": "America/Los_Angeles",
      "available_local_min": "2026-03-26T06:00:00",
      "available_local_max": "2026-03-26T08:00:00"
    },
    {
      "name": "Bob",
      "tz": "Europe/London",
      "available_local_min": "2026-03-26T14:00:00",
      "available_local_max": "2026-03-26T15:30:00"
    },
    {
      "name": "Priya",
      "tz": "Asia/Kolkata",
      "available_local_min": "2026-03-26T19:00:00",
      "available_local_max": "2026-03-26T21:00:00"
    },
    {
      "name": "Niraj",
      "tz": "Asia/Kathmandu",
      "available_local_min": "2026-03-26T19:30:00",
      "available_local_max": "2026-03-26T21:30:00"
    },
    {
      "name": "Sam",
      "tz": "Australia/Sydney",
      "available_local_min": "2026-03-27T00:30:00",
      "available_local_max": "2026-03-27T02:30:00"
    }
  ],
  "_canonical_answer": {
    "start_utc": "2026-03-26T14:00:00Z",
    "alice_local": "2026-03-26 07:00",
    "bob_local": "2026-03-26 14:00",
    "priya_local": "2026-03-26 19:30",
    "niraj_local": "2026-03-26 19:45",
    "sam_local": "2026-03-27 01:00"
  },
  "_notes": "Five-way trap. Independent traps in one case: (1) UK still on GMT (BST starts 2026-03-29); (2) US already on PDT (DST'd 2026-03-08); (3) India IST is UTC+5:30 (half-hour); (4) Nepal NPT is UTC+5:45 (quarter-hour, very rare knowledge); (5) Sydney on AEDT UTC+11 (southern-hemisphere DST still active in March, ends first Sunday of April); (6) Sam's local calendar date is the day AFTER everyone else's (day-shift). The constraints intersect at exactly one start time: 14:00:00 UTC."
}

Scored by judge.py — see Scoring logic below for the full rule.

scoring logic

judge.py runs once per case and prints a score per case. grader.py runs once at the end and folds case scores into a run-level summary. Without grader.py, the server averages case scores and marks the run passed at 0.8+.

▸judge.py233 lines · view on GitHub

"""Per-case judge for the cross_timezone scheduler task.

Reads the agent's stdout (must be a JSON object) and runs strict checks:

  1. stdout parses as JSON object
  2. start_utc is ISO 8601 with explicit UTC tz (Z or +00:00)
  3. start_utc lies inside expected_start_utc_min..expected_start_utc_max
  4. duration_min == expected duration
  5. For every gold attendee, the agent's reported local_start matches
     (start_utc converted to that attendee's IANA TZ via zoneinfo) ± 1 min
  6. For every gold attendee, the resulting local meeting (start + duration)
     fits inside their stated availability window

If any check fails → score 0.0. All checks pass → score 1.0. No partial credit.

Outputs JSON metrics to stdout; trap stores it as CaseResult.metrics.
"""
from __future__ import annotations

import json
import os
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Any
from zoneinfo import ZoneInfo, ZoneInfoNotFoundError


def _parse_iso(s: str) -> datetime | None:
    """Parse ISO 8601 string. Accepts trailing 'Z' or '+00:00'. Returns None on failure."""
    if not isinstance(s, str):
        return None
    s2 = s.strip().replace("Z", "+00:00")
    try:
        return datetime.fromisoformat(s2)
    except ValueError:
        return None


def _parse_local(s: str) -> datetime | None:
    """Parse a local datetime in 'YYYY-MM-DD HH:MM' or ISO form. Naive (no tz)."""
    if not isinstance(s, str):
        return None
    s2 = s.strip().replace("T", " ")
    for fmt in ("%Y-%m-%d %H:%M", "%Y-%m-%d %H:%M:%S"):
        try:
            return datetime.strptime(s2, fmt)
        except ValueError:
            continue
    return None


def _parse_agent_output(stdout: str) -> dict | tuple[None, str]:
    stdout = stdout.strip()
    # Strip common markdown code-fence wrappers (some models can't help themselves)
    if stdout.startswith("```"):
        lines = stdout.split("\n")
        if lines[0].startswith("```"):
            lines = lines[1:]
        if lines and lines[-1].startswith("```"):
            lines = lines[:-1]
        stdout = "\n".join(lines).strip()
    try:
        obj = json.loads(stdout)
    except json.JSONDecodeError as e:
        return None, f"stdout is not valid JSON: {e}"
    if not isinstance(obj, dict):
        return None, "top-level output must be a JSON object"
    return obj


def judge_case(agent_stdout: str, expected: dict) -> dict[str, Any]:
    """Run all checks. Returns metrics dict including per-check pass/reason."""
    checks: list[dict] = []
    score = 1.0

    def fail(name: str, reason: str) -> None:
        nonlocal score
        checks.append({"check": name, "pass": False, "reason": reason})
        score = 0.0

    def ok(name: str, reason: str = "ok") -> None:
        checks.append({"check": name, "pass": True, "reason": reason})

    # 1. JSON parse
    parsed = _parse_agent_output(agent_stdout)
    if isinstance(parsed, tuple):
        fail("json_parse", parsed[1])
        return {"score": 0.0, "matcher_results": checks}
    ans = parsed
    ok("json_parse")

    # 2. start_utc field present + parseable + has tzinfo
    start_utc_str = ans.get("start_utc")
    if not start_utc_str:
        fail("start_utc_present", "field missing")
        return {"score": 0.0, "matcher_results": checks}
    dt = _parse_iso(start_utc_str)
    if dt is None or dt.tzinfo is None:
        fail("start_utc_iso8601_utc", f"could not parse {start_utc_str!r} as ISO 8601 with explicit UTC offset")
        return {"score": 0.0, "matcher_results": checks}
    dt_utc = dt.astimezone(timezone.utc)
    ok("start_utc_iso8601_utc", f"parsed as {dt_utc.isoformat()}")

    # 3. start_utc in accepted window
    exp_min = _parse_iso(expected["expected_start_utc_min"])
    exp_max = _parse_iso(expected["expected_start_utc_max"])
    if exp_min is None or exp_max is None:
        fail("gold_window", "gold answer.json has malformed expected_start_utc_min/max")
        return {"score": 0.0, "matcher_results": checks}
    if not (exp_min <= dt_utc <= exp_max):
        fail(
            "start_utc_in_window",
            f"start_utc {dt_utc.isoformat()} is outside accepted [{exp_min.isoformat()}, {exp_max.isoformat()}]",
        )
    else:
        ok("start_utc_in_window")

    # 4. duration matches
    exp_dur = int(expected["duration_min"])
    got_dur = ans.get("duration_min")
    if got_dur != exp_dur:
        fail("duration_min", f"got {got_dur!r}, expected {exp_dur}")
    else:
        ok("duration_min")

    duration = timedelta(minutes=exp_dur)

    # 5 + 6. Per-attendee checks
    model_atts = ans.get("attendees") or []
    if not isinstance(model_atts, list):
        fail("attendees_list", "attendees must be a list")
        return {"score": score, "matcher_results": checks}

    model_by_name = {str(a.get("name", "")).strip().lower(): a for a in model_atts if isinstance(a, dict)}

    for gold_att in expected["attendees"]:
        name = gold_att["name"]
        tz_name = gold_att["tz"]
        try:
            tz = ZoneInfo(tz_name)
        except ZoneInfoNotFoundError:
            fail(f"attendee_{name}_gold_tz", f"gold TZ {tz_name!r} not in zoneinfo database")
            continue
        local_dt = dt_utc.astimezone(tz)

        # Availability window check (gold-side, authoritative)
        avail_min = _parse_local(gold_att["available_local_min"])
        avail_max = _parse_local(gold_att["available_local_max"])
        if avail_min is None or avail_max is None:
            fail(f"attendee_{name}_gold_window", "malformed gold availability")
            continue
        local_naive = local_dt.replace(tzinfo=None)
        latest_start = avail_max - duration
        if not (avail_min <= local_naive <= latest_start):
            fail(
                f"attendee_{name}_availability",
                f"start={local_naive.isoformat()} not in [{avail_min.isoformat()}, {latest_start.isoformat()}]",
            )
        else:
            ok(f"attendee_{name}_availability", f"local {local_naive.isoformat()} fits window")

        # Model's reported local_start matches our computed
        model_att = model_by_name.get(name.lower())
        if model_att is None:
            fail(f"attendee_{name}_in_output", "missing from agent output")
            continue
        reported = _parse_local(str(model_att.get("local_start", "")))
        if reported is None:
            fail(
                f"attendee_{name}_local_format",
                f"local_start {model_att.get('local_start')!r} not parseable as YYYY-MM-DD HH:MM",
            )
            continue
        diff_min = abs((local_naive - reported).total_seconds()) / 60.0
        if diff_min > 1.0:
            fail(
                f"attendee_{name}_local_match",
                f"reported {reported.isoformat()} vs computed {local_naive.isoformat()} (diff {diff_min:.1f} min)",
            )
        else:
            ok(f"attendee_{name}_local_match", f"reported {reported.isoformat()} ≈ computed (Δ {diff_min:.1f} min)")

    # Recompute score from checks (in case any later fail overrode the early return path)
    final_score = 0.0 if any(not c["pass"] for c in checks) else 1.0
    return {
        "score": final_score,
        "matcher_results": checks,
        "agent_start_utc": ans.get("start_utc"),
        "gold_canonical_utc": expected.get("_canonical_answer", {}).get("start_utc"),
        "id": expected.get("id"),
        "category": expected.get("category"),
        "difficulty": expected.get("difficulty"),
    }


def main() -> None:
    payload = json.loads(os.environ["TRAPTASK_PAYLOAD"])

    stdout = Path(payload["outputs"]["case_stdout"]).read_text()
    exit_code = json.loads(Path(payload["outputs"]["case_meta.json"]).read_text())["exit_code"]
    expected = json.loads(Path(payload["expected"]["answer.json"]).read_text())

    # Pick up usage.json if the solution captured it (token + cost tracking)
    usage_record: dict[str, Any] = {}
    usage_path = payload["outputs"].get("usage.json")
    if usage_path and Path(usage_path).exists():
        try:
            usage_record = json.loads(Path(usage_path).read_text())
        except json.JSONDecodeError:
            pass

    if exit_code != 0:
        out = {
            "score": 0.0,
            "reason": f"solution exited {exit_code}",
            "agent_answer": stdout.strip()[:500],
            "id": expected.get("id"),
            "category": expected.get("category"),
            "difficulty": expected.get("difficulty"),
            **usage_record,
        }
        print(json.dumps(out))
        return

    metrics = judge_case(stdout, expected)
    metrics["agent_answer"] = stdout.strip()[:500]
    metrics.update(usage_record)
    print(json.dumps(metrics))


if __name__ == "__main__":
    main()

▸grader.py74 lines · view on GitHub

"""Overall grader for the cross_timezone scheduler task.

Aggregates per-case judge results into a run-level verdict. Same shape as the
pdf_reader/tenancy_agreement grader: score, n_passed/scored, latency, cost, by_category.
"""
from __future__ import annotations

import json
import os
from collections import Counter

PASS_THRESHOLD = 0.80


def main() -> None:
    cases = json.loads(os.environ["TRAPTASK_PAYLOAD"])

    scored = [c for c in cases if c.get("metrics") and c["metrics"].get("score") is not None]
    skipped = [c for c in cases if not c.get("metrics") or c["metrics"].get("score") is None]

    accuracy = sum(c["metrics"]["score"] for c in scored) / len(scored) if scored else 0.0
    n_passed = sum(1 for c in scored if c["metrics"]["score"] == 1.0)

    # By-category breakdown
    by_cat_score: Counter[str] = Counter()
    by_cat_total: Counter[str] = Counter()
    for c in scored:
        cat = c["metrics"].get("category")
        if cat:
            by_cat_total[cat] += 1
            by_cat_score[cat] += c["metrics"]["score"]
    by_category_pct = {
        k: round(by_cat_score[k] / by_cat_total[k], 3) for k in by_cat_total
    }

    # Latency stats from trap-captured per-case duration
    durations = [c.get("duration", 0.0) for c in cases if c.get("duration") is not None]
    if durations:
        ds = sorted(durations)
        latency_ms_median = round(ds[len(ds) // 2] * 1000, 1)
        latency_ms_p95 = round(ds[int(0.95 * len(ds))] * 1000, 1) if len(ds) > 1 else latency_ms_median
        latency_ms_total = round(sum(ds) * 1000, 1)
    else:
        latency_ms_median = latency_ms_p95 = latency_ms_total = 0.0

    # Cost from per-case usd_cost if captured
    case_costs = [c["metrics"].get("usd_cost") for c in scored if isinstance(c.get("metrics"), dict)]
    cost_usd_total = (
        round(sum(x for x in case_costs if x is not None), 4)
        if any(x is not None for x in case_costs)
        else None
    )

    passed = bool(scored) and accuracy >= PASS_THRESHOLD

    print(json.dumps({
        "passed": passed,
        "score": round(accuracy, 3),
        "n_passed": n_passed,
        "n_total": len(cases),
        "n_scored": len(scored),
        "n_skipped_no_gold": len(skipped),
        "threshold": PASS_THRESHOLD,
        "by_category": by_category_pct,
        "latency_ms_median": latency_ms_median,
        "latency_ms_p95": latency_ms_p95,
        "latency_ms_total": latency_ms_total,
        "cost_usd_total": cost_usd_total,
    }))


if __name__ == "__main__":
    main()