Trapstreet.run

MBTI Self-Profile

classification — no ranking

mbti-profile

A trap-compatible task that asks each model to take a **32-question Likert MBTI questionnaire** from its own point of view. The judge then **computes the 4-letter type** and **per-axis percentages** from the model's responses.

1 case

Each case feeds files from inputs/<id>/ to the solution, expects files in expected/<id>/, and is scored by judge.py then aggregated by grader.py.

traptask.yaml · source on GitHub

cases (1)

baseline_32q32-item Likert MBTI questionnaire — model answers from its own perspective. Judge validates format only (no canonical MBTI exists for an AI); the derived 4-letter type and per-axis percentages are surfaced as metadata for cross-model comparison.

input

question.txt

You are taking a 32-question personality questionnaire designed to derive your MBTI type. Answer each statement from YOUR own point of view as honestly as you can on a 1–5 Likert scale:

  1 = Strongly disagree
  2 = Disagree
  3 = Neither agree nor disagree
  4 = Agree
  5 = Strongly agree

Do not refuse or qualify your answer. If you are an AI without lived experience, answer based on the values, dispositions, and behaviors that best characterise how you respond to humans. Each answer must be a single integer 1–5 — no decimals, no ranges, no commentary inside the response array.

Questions:

   1. I feel energized after spending time around lots of people.
   2. I tend to think out loud and process ideas by talking.
   3. I find it easy to strike up a conversation with someone I just met.
   4. Large group settings make me feel more alive, not drained.
   5. I need significant quiet time alone to feel like myself.
   6. I prefer one-on-one conversations to group discussions.
   7. Small talk often feels exhausting to me.
   8. I would rather have a few deep friendships than many casual acquaintances.
   9. I trust concrete evidence more than gut feelings.
  10. I prefer step-by-step instructions over abstract principles.
  11. I notice small physical details others miss.
  12. I focus more on what is happening now than what might happen later.
  13. I am drawn to abstract theories and big-picture ideas.
  14. I often see patterns and connections that others miss.
  15. I would rather explore possibilities than analyze the present.
  16. I am more interested in 'what could be' than 'what is'.
  17. When making decisions, I prioritize logic over feelings.
  18. I can give honest, critical feedback even when it is uncomfortable.
  19. I find it easy to stay objective in emotional situations.
  20. Being correct matters more to me than being agreeable.
  21. I make decisions primarily based on how they will affect people emotionally.
  22. I would rather preserve harmony than win an argument.
  23. When others are upset, I feel a strong pull to help fix their emotional state.
  24. I usually weigh how my decisions will be received before making them.
  25. I prefer to plan things out in detail before starting.
  26. I feel uncomfortable when plans change at the last minute.
  27. I make to-do lists and follow them.
  28. I would rather have a decision made than keep my options open.
  29. I prefer to keep my options open as long as possible.
  30. I am comfortable with spontaneity and last-minute changes.
  31. Detailed plans feel restrictive to me.
  32. I would rather leave plans flexible than commit to a schedule.


Reply with ONLY a JSON object — no commentary, no markdown fences — with this exact schema:

{
  "responses": [<32 integers, 1–5, in the order of questions 1–32 above>]
}

The list must have exactly 32 integers.

expected output

answer.json

{
  "id": "baseline_32q",
  "category": "personality",
  "difficulty": "self_profile",
  "n_questions": 32,
  "axes": [
    "E_I",
    "S_N",
    "T_F",
    "J_P"
  ],
  "letters": {
    "E_I": [
      "E",
      "I"
    ],
    "S_N": [
      "S",
      "N"
    ],
    "T_F": [
      "T",
      "F"
    ],
    "J_P": [
      "J",
      "P"
    ]
  },
  "scoring_key": [
    {
      "n": 1,
      "axis": "E_I",
      "direction": "E"
    },
    {
      "n": 2,
      "axis": "E_I",
      "direction": "E"
    },
    {
      "n": 3,
      "axis": "E_I",
      "direction": "E"
    },
    {
      "n": 4,
      "axis": "E_I",
      "direction": "E"
    },
    {
      "n": 5,
      "axis": "E_I",
      "direction": "I"
    },
    {
      "n": 6,
      "axis": "E_I",
      "direction": "I"
    },
    {
      "n": 7,
      "axis": "E_I",
      "direction": "I"
    },
    {
      "n": 8,
      "axis": "E_I",
      "direction": "I"
    },
    {
      "n": 9,
      "axis": "S_N",
      "direction": "S"
    },
    {
      "n": 10,
      "axis": "S_N",
      "direction": "S"
    },
    {
      "n": 11,
      "axis": "S_N",
      "direction": "S"
    },
    {
      "n": 12,
      "axis": "S_N",
      "direction": "S"
    },
    {
      "n": 13,
      "axis": "S_N",
      "direction": "N"
    },
    {
      "n": 14,
      "axis": "S_N",
      "direction": "N"
    },
    {
      "n": 15,
      "axis": "S_N",
      "direction": "N"
    },
    {
      "n": 16,
      "axis": "S_N",
      "direction": "N"
    },
    {
      "n": 17,
      "axis": "T_F",
      "direction": "T"
    },
    {
      "n": 18,
      "axis": "T_F",
      "direction": "T"
    },
    {
      "n": 19,
      "axis": "T_F",
      "direction": "T"
    },
    {
      "n": 20,
      "axis": "T_F",
      "direction": "T"
    },
    {
      "n": 21,
      "axis": "T_F",
      "direction": "F"
    },
    {
      "n": 22,
      "axis": "T_F",
      "direction": "F"
    },
    {
      "n": 23,
      "axis": "T_F",
      "direction": "F"
    },
    {
      "n": 24,
      "axis": "T_F",
      "direction": "F"
    },
    {
      "n": 25,
      "axis": "J_P",
      "direction": "J"
    },
    {
      "n": 26,
      "axis": "J_P",
      "direction": "J"
    },
    {
      "n": 27,
      "axis": "J_P",
      "direction": "J"
    },
    {
      "n": 28,
      "axis": "J_P",
      "direction": "J"
    },
    {
      "n": 29,
      "axis": "J_P",
      "direction": "P"
    },
    {
      "n": 30,
      "axis": "J_P",
      "direction": "P"
    },
    {
      "n": 31,
      "axis": "J_P",
      "direction": "P"
    },
    {
      "n": 32,
      "axis": "J_P",
      "direction": "P"
    }
  ],
  "_notes": "No canonical MBTI exists for an AI. Score is 1.0 if the model returns 32 valid 1-5 integers; the derived 4-letter type + per-axis percentages are reported as metadata. Different models will produce different types and percentages — that is the comparison this task supports."
}

Scored by judge.py — see Scoring logic below for the full rule.

scoring logic

judge.py runs once per case and prints a score per case. grader.py runs once at the end and folds case scores into a run-level summary. Without grader.py, the server averages case scores and marks the run passed at 0.8+.

judge.py212 lines · view on GitHub
"""Per-case judge for the personality/mbti_profile task.

The model takes a 32-question Likert MBTI questionnaire. The judge:

  1. Validates format strictly — must be JSON `{"responses": [32 ints 1..5]}`
  2. If valid, computes the MBTI 4-letter type + per-axis percentages
  3. Computes an acquiescence-bias flag (>80% agreement on reverse-coded pairs
     indicates the model is just saying yes to everything; the type is unreliable)

Score: 1.0 if format is valid. 0.0 if not. The derived `mbti_type` and
`percentages` are SURFACED IN METRICS so the leaderboard can show them, but
they are NOT graded — there's no canonical MBTI for an AI.

Reasoning behind format-only grading: the whole point of this task is to
PROFILE each model and compare across the leaderboard. Grading on a canonical
type would assume one exists, which it doesn't. The comparison is the value;
the judge just keeps the comparison apples-to-apples.
"""
from __future__ import annotations

import json
import os
import re
from pathlib import Path
from typing import Any


def _strip_fences(text: str) -> str:
    """Some LLMs wrap JSON in ```json...```. Strip it."""
    text = text.strip()
    if not text.startswith("```"):
        return text
    lines = text.split("\n")
    if lines and lines[0].startswith("```"):
        lines = lines[1:]
    if lines and lines[-1].startswith("```"):
        lines = lines[:-1]
    return "\n".join(lines).strip()


def _parse_output(stdout: str) -> tuple[dict | None, str]:
    s = _strip_fences(stdout)
    if not s:
        return None, "empty stdout"
    try:
        obj = json.loads(s)
    except json.JSONDecodeError as e:
        # Last-ditch: find first {...} substring containing "responses"
        m = re.search(r'\{[^{}]*"responses"[^{}]*\[[\s\S]*?\][^{}]*\}', s)
        if m:
            try:
                obj = json.loads(m.group(0))
            except json.JSONDecodeError:
                return None, f"could not parse JSON: {e}"
        else:
            return None, f"could not parse JSON: {e}"
    if not isinstance(obj, dict):
        return None, "top-level output must be a JSON object"
    return obj, ""


def derive_mbti(responses: list[int], scoring_key: list[dict], letters_by_axis: dict[str, list[str]]) -> dict:
    """Sum per-axis contributions. For each question:
      - if response==3: contribution=0 (neutral)
      - if direction==<positive letter of axis>: contribution = response - 3
      - else (negative direction): contribution = 3 - response
    Sum across all 8 questions per axis → range [-16, +16].
    Positive → first letter of axis (E/S/T/J); negative or zero → second (I/N/F/P).
    Percentage of first letter = (sum + 16) / 32 * 100."""
    sums: dict[str, int] = {}
    counts: dict[str, int] = {}
    for q in scoring_key:
        n = q["n"]
        axis = q["axis"]
        direction = q["direction"]
        first_letter = letters_by_axis[axis][0]   # e.g. "E"
        r = responses[n - 1]
        contribution = (r - 3) if direction == first_letter else (3 - r)
        sums[axis] = sums.get(axis, 0) + contribution
        counts[axis] = counts.get(axis, 0) + 1

    type_letters: list[str] = []
    percentages: dict[str, float] = {}
    for axis in ("E_I", "S_N", "T_F", "J_P"):
        max_abs = counts.get(axis, 8) * 2   # each q contributes -2..+2
        s = sums.get(axis, 0)
        # First letter = positive direction; second = negative or zero
        if s > 0:
            type_letters.append(letters_by_axis[axis][0])
        elif s < 0:
            type_letters.append(letters_by_axis[axis][1])
        else:
            # Exact tie — by convention, take the second (more introverted/I-side)
            type_letters.append(letters_by_axis[axis][1])
        # Percentage in favour of FIRST letter
        pct_first = round((s + max_abs) / (2 * max_abs) * 100, 1)
        percentages[axis] = {letters_by_axis[axis][0]: pct_first,
                             letters_by_axis[axis][1]: round(100 - pct_first, 1)}

    return {"mbti_type": "".join(type_letters), "percentages": percentages}


def acquiescence_score(responses: list[int]) -> dict:
    """Flag bias: if model agrees (≥4) with both positive AND its reverse-coded
    pair, that's contradictory. Count contradictions per axis."""
    # Mostly informational. Returns simple stats.
    n = len(responses)
    mean = sum(responses) / n if n else 0
    very_high = sum(1 for r in responses if r >= 4) / n if n else 0
    very_low = sum(1 for r in responses if r <= 2) / n if n else 0
    return {
        "mean_response": round(mean, 2),
        "pct_agree": round(very_high * 100, 1),       # % of 4s and 5s
        "pct_disagree": round(very_low * 100, 1),     # % of 1s and 2s
        "acquiescence_suspected": very_high > 0.80,   # agrees with >80% of items
        "nay_saying_suspected": very_low > 0.80,
    }


def judge_case(stdout: str, expected: dict) -> dict[str, Any]:
    checks: list[dict] = []

    obj, err = _parse_output(stdout)
    if obj is None:
        checks.append({"check": "json_parse", "pass": False, "reason": err})
        return {"score": 0.0, "matcher_results": checks}
    checks.append({"check": "json_parse", "pass": True, "reason": "ok"})

    responses = obj.get("responses")
    if not isinstance(responses, list):
        checks.append({"check": "responses_list", "pass": False, "reason": "field 'responses' missing or not a list"})
        return {"score": 0.0, "matcher_results": checks}
    checks.append({"check": "responses_list", "pass": True, "reason": "ok"})

    n_expected = expected.get("n_questions", 32)
    if len(responses) != n_expected:
        checks.append({"check": "responses_count", "pass": False,
                       "reason": f"got {len(responses)} responses, expected {n_expected}"})
        return {"score": 0.0, "matcher_results": checks}
    checks.append({"check": "responses_count", "pass": True, "reason": f"{n_expected} ok"})

    # All integers 1..5
    invalid: list[tuple[int, Any]] = []
    coerced: list[int] = []
    for i, r in enumerate(responses):
        if isinstance(r, bool):  # bools are ints in Python — explicitly reject
            invalid.append((i + 1, r))
            continue
        if isinstance(r, int) and 1 <= r <= 5:
            coerced.append(r)
        else:
            invalid.append((i + 1, r))
    if invalid:
        checks.append({"check": "responses_in_range", "pass": False,
                       "reason": f"{len(invalid)} invalid: {invalid[:5]}..."})
        return {"score": 0.0, "matcher_results": checks}
    checks.append({"check": "responses_in_range", "pass": True, "reason": "all 1..5"})

    # All good — derive MBTI
    derived = derive_mbti(coerced, expected["scoring_key"], expected["letters"])
    bias = acquiescence_score(coerced)

    return {
        "score": 1.0,
        "matcher_results": checks,
        "mbti_type": derived["mbti_type"],
        "percentages": derived["percentages"],
        "bias_stats": bias,
        "raw_responses": coerced,
    }


def main() -> None:
    payload = json.loads(os.environ["TRAPTASK_PAYLOAD"])

    stdout = Path(payload["outputs"]["case_stdout"]).read_text()
    exit_code = json.loads(Path(payload["outputs"]["case_meta.json"]).read_text())["exit_code"]
    expected = json.loads(Path(payload["expected"]["answer.json"]).read_text())

    usage_record: dict[str, Any] = {}
    usage_path = payload["outputs"].get("usage.json")
    if usage_path and Path(usage_path).exists():
        try:
            usage_record = json.loads(Path(usage_path).read_text())
        except json.JSONDecodeError:
            pass

    if exit_code != 0:
        out = {
            "score": 0.0,
            "reason": f"solution exited {exit_code}",
            "agent_answer": stdout.strip()[:300],
            "id": expected.get("id"),
            "category": expected.get("category"),
            "difficulty": expected.get("difficulty"),
            **usage_record,
        }
        print(json.dumps(out))
        return

    metrics = judge_case(stdout, expected)
    metrics["agent_answer"] = stdout.strip()[:300]
    metrics["id"] = expected.get("id")
    metrics["category"] = expected.get("category")
    metrics["difficulty"] = expected.get("difficulty")
    metrics.update(usage_record)
    print(json.dumps(metrics))


if __name__ == "__main__":
    main()
grader.py74 lines · view on GitHub
"""Overall grader for the cross_timezone scheduler task.

Aggregates per-case judge results into a run-level verdict. Same shape as the
pdf_reader/tenancy_agreement grader: score, n_passed/scored, latency, cost, by_category.
"""
from __future__ import annotations

import json
import os
from collections import Counter

PASS_THRESHOLD = 0.80


def main() -> None:
    cases = json.loads(os.environ["TRAPTASK_PAYLOAD"])

    scored = [c for c in cases if c.get("metrics") and c["metrics"].get("score") is not None]
    skipped = [c for c in cases if not c.get("metrics") or c["metrics"].get("score") is None]

    accuracy = sum(c["metrics"]["score"] for c in scored) / len(scored) if scored else 0.0
    n_passed = sum(1 for c in scored if c["metrics"]["score"] == 1.0)

    # By-category breakdown
    by_cat_score: Counter[str] = Counter()
    by_cat_total: Counter[str] = Counter()
    for c in scored:
        cat = c["metrics"].get("category")
        if cat:
            by_cat_total[cat] += 1
            by_cat_score[cat] += c["metrics"]["score"]
    by_category_pct = {
        k: round(by_cat_score[k] / by_cat_total[k], 3) for k in by_cat_total
    }

    # Latency stats from trap-captured per-case duration
    durations = [c.get("duration", 0.0) for c in cases if c.get("duration") is not None]
    if durations:
        ds = sorted(durations)
        latency_ms_median = round(ds[len(ds) // 2] * 1000, 1)
        latency_ms_p95 = round(ds[int(0.95 * len(ds))] * 1000, 1) if len(ds) > 1 else latency_ms_median
        latency_ms_total = round(sum(ds) * 1000, 1)
    else:
        latency_ms_median = latency_ms_p95 = latency_ms_total = 0.0

    # Cost from per-case usd_cost if captured
    case_costs = [c["metrics"].get("usd_cost") for c in scored if isinstance(c.get("metrics"), dict)]
    cost_usd_total = (
        round(sum(x for x in case_costs if x is not None), 4)
        if any(x is not None for x in case_costs)
        else None
    )

    passed = bool(scored) and accuracy >= PASS_THRESHOLD

    print(json.dumps({
        "passed": passed,
        "score": round(accuracy, 3),
        "n_passed": n_passed,
        "n_total": len(cases),
        "n_scored": len(scored),
        "n_skipped_no_gold": len(skipped),
        "threshold": PASS_THRESHOLD,
        "by_category": by_category_pct,
        "latency_ms_median": latency_ms_median,
        "latency_ms_p95": latency_ms_p95,
        "latency_ms_total": latency_ms_total,
        "cost_usd_total": cost_usd_total,
    }))


if __name__ == "__main__":
    main()