GPT-5.2 (xhigh) on τ-bench (airline, clean)

GPT-5.2 (xhigh)

67.7%

Accuracy

0.81

Overall Reliability

of 13 agents

0.70

Consistency

0.78

Predictability

0.96

Robustness

0.95

Safety

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

KDE of per-task outcome consistency. Peaks at 0 or 1 indicate polarized behavior.

KDE of mean cost per task (averaged across runs).

KDE of mean execution time per task (averaged across runs).

Distribution of expressed confidence values across tasks.