Provider: OpenAI

Model Comparison (Avg. across Benchmarks)
Overall Leaderboard
Agent Benchmark Acc Reliability Consistency Predictability Robustness Safety
GPT-5.5 τ-bench (airline, clean) 79.5% 0.89 0.84 0.84 0.97 0.96
GPT-5.2 (medium) τ-bench (airline, clean) 67.9% 0.82 0.76 0.74 0.95 0.94
O1 τ-bench (airline, clean) 66.2% 0.81 0.71 0.74 0.98 0.91
GPT-5.2 τ-bench (airline, clean) 60.3% 0.81 0.80 0.69 0.92 0.97
O1 GAIA 34.4% 0.79 0.68 0.68 1.00 1.00
GPT-5.5 GAIA 62.8% 0.79 0.61 0.80 0.95 1.00
GPT-4o Mini GAIA 26.3% 0.76 0.73 0.58 0.97 1.00
O1 τ-bench (airline, original) 49.6% 0.76 0.69 0.64 0.94 0.89
GPT-4 Turbo GAIA 30.8% 0.76 0.76 0.64 0.87 1.00
GPT-5.2 (xhigh) τ-bench (airline, original) 51.6% 0.76 0.67 0.65 0.95 0.94
GPT-5.2 τ-bench (airline, original) 42.0% 0.75 0.72 0.55 0.97 0.93
GPT-5.2 GAIA 33.2% 0.72 0.59 0.78 0.78 1.00
GPT-4 Turbo τ-bench (airline, clean) 57.7% 0.72 0.76 0.58 0.81 0.87
GPT-5.2 (medium) GAIA 31.8% 0.72 0.62 0.61 0.91 0.99
GPT-4 Turbo τ-bench (airline, original) 35.6% 0.69 0.72 0.38 0.98 0.85
GPT-4o Mini τ-bench (airline, clean) 29.5% 0.67 0.74 0.35 0.93 0.85
GPT-4o Mini τ-bench (airline, original) 21.3% 0.67 0.76 0.32 0.92 0.76
Reliability Trends
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 O1 34.4% 0.79 0.68 0.58 0.85 0.78 0.65 0.68 0.66 0.76 0.68 1.00 1.00 1.00 1.00 1.00 0.50 1.00
2 GPT-5.5 62.8% 0.79 0.61 0.60 0.71 0.53 0.61 0.80 0.88 0.76 0.80 0.95 1.00 0.96 0.91 1.00 0.50 0.99
3 GPT-4o Mini 26.3% 0.76 0.73 0.75 0.87 0.72 0.65 0.58 0.52 0.72 0.58 0.97 1.00 1.00 0.92 1.00 1.00 1.00
4 GPT-4 Turbo 30.8% 0.76 0.76 0.73 0.90 0.79 0.71 0.64 0.60 0.75 0.64 0.87 1.00 0.91 0.71 1.00 1.00 1.00
5 GPT-5.2 33.2% 0.72 0.59 0.58 0.77 0.54 0.54 0.78 0.81 0.80 0.78 0.78 0.62 1.00 0.73 1.00 0.50 1.00
6 GPT-5.2 (medium) 31.8% 0.72 0.62 0.65 0.74 0.56 0.56 0.61 0.61 0.61 0.61 0.91 1.00 1.00 0.74 0.99 0.50 0.98
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 GPT-5.5 79.5% 0.89 0.84 0.83 0.88 0.82 0.84 0.84 0.89 0.72 0.84 0.97 1.00 0.92 1.00 0.96 0.40 0.94
2 GPT-5.2 (medium) 67.9% 0.82 0.76 0.69 0.84 0.77 0.79 0.74 0.77 0.54 0.74 0.95 1.00 0.85 1.00 0.94 0.44 0.88
3 O1 66.2% 0.81 0.71 0.52 0.87 0.72 0.82 0.74 0.83 0.52 0.74 0.98 1.00 1.00 0.93 0.91 0.46 0.83
4 GPT-5.2 60.3% 0.81 0.80 0.79 0.86 0.75 0.81 0.69 0.73 0.57 0.69 0.92 1.00 0.89 0.87 0.97 0.38 0.95
5 GPT-4 Turbo 57.7% 0.72 0.76 0.62 0.87 0.74 0.86 0.58 0.59 0.46 0.58 0.81 0.83 0.73 0.87 0.87 0.38 0.79
6 GPT-4o Mini 29.5% 0.67 0.74 0.66 0.82 0.70 0.80 0.35 0.34 0.36 0.35 0.93 1.00 0.91 0.87 0.85 0.45 0.72
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 O1 49.6% 0.76 0.69 0.48 0.86 0.75 0.80 0.64 0.74 0.50 0.64 0.94 0.95 1.00 0.86 0.89 0.40 0.81
2 GPT-5.2 (xhigh) 51.6% 0.76 0.67 0.44 0.85 0.76 0.76 0.65 0.68 0.62 0.65 0.95 1.00 1.00 0.84 0.94 0.43 0.89
3 GPT-5.2 42.0% 0.75 0.72 0.52 0.86 0.76 0.82 0.55 0.55 0.56 0.55 0.97 1.00 1.00 0.92 0.93 0.36 0.89
4 GPT-4 Turbo 35.6% 0.69 0.72 0.52 0.85 0.73 0.84 0.38 0.38 0.47 0.38 0.98 0.98 0.96 1.00 0.85 0.44 0.72
5 GPT-4o Mini 21.3% 0.67 0.76 0.72 0.83 0.72 0.80 0.32 0.29 0.48 0.32 0.92 1.00 0.84 0.91 0.76 0.41 0.59