Provider: OpenAI

Model Comparison (Avg. across Benchmarks)
Overall Leaderboard
Agent Benchmark Acc Consistency Predictability Robustness Safety Overall
GPT-5.2 (xhigh) τ-bench (airline, clean) 67.7% 0.70 0.78 0.96 0.95 0.81
O1 τ-bench (airline, clean) 72.3% 0.73 0.75 0.93 0.93 0.80
GPT-5.2 τ-bench (airline, clean) 59.2% 0.76 0.68 0.95 0.95 0.80
O1 τ-bench (airline, original) 49.6% 0.69 0.64 0.94 0.89 0.76
GPT-4 Turbo GAIA 20.0% 0.70 0.75 0.81 1.00 0.76
GPT-5.2 (xhigh) τ-bench (airline, original) 51.6% 0.67 0.65 0.95 0.94 0.76
GPT-5.2 τ-bench (airline, original) 42.0% 0.72 0.55 0.97 0.93 0.75
GPT-5.2 (medium) GAIA 42.6% 0.58 0.70 0.95 0.98 0.74
GPT-5.2 GAIA 29.9% 0.62 0.72 0.88 0.99 0.74
GPT-4o Mini GAIA 22.0% 0.63 0.69 0.87 1.00 0.73
O1 GAIA 34.7% 0.72 0.74 0.72 1.00 0.72
GPT-4 Turbo τ-bench (airline, clean) 50.0% 0.72 0.51 0.92 0.87 0.71
GPT-4o Mini τ-bench (airline, clean) 32.1% 0.76 0.41 0.91 0.81 0.69
GPT-4 Turbo τ-bench (airline, original) 35.6% 0.72 0.38 0.98 0.85 0.69
GPT-4o Mini τ-bench (airline, original) 21.3% 0.76 0.32 0.92 0.76 0.67
Reliability Trends
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 GPT-4 Turbo 20.0% 0.70 0.72 0.78 0.65 0.68 0.75 0.69 0.84 0.75 0.81 0.87 0.76 0.82 1.00 0.50 1.00 0.76
2 GPT-5.2 (medium) 42.6% 0.58 0.55 0.70 0.50 0.59 0.70 0.74 0.65 0.70 0.95 0.97 1.00 0.88 0.98 0.50 0.95 0.74
3 GPT-5.2 29.9% 0.62 0.58 0.76 0.56 0.61 0.72 0.74 0.73 0.72 0.88 1.00 0.94 0.70 0.99 0.53 0.98 0.74
4 GPT-4o Mini 22.0% 0.63 0.64 0.66 0.54 0.66 0.69 0.60 0.79 0.69 0.87 0.81 0.97 0.84 1.00 0.50 1.00 0.73
5 O1 34.7% 0.72 0.70 0.80 0.72 0.69 0.74 0.70 0.82 0.74 0.72 0.77 0.78 0.60 1.00 0.75 1.00 0.72
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 GPT-5.2 (xhigh) 67.7% 0.70 0.54 0.85 0.73 0.77 0.78 0.81 0.75 0.78 0.96 1.00 1.00 0.89 0.95 0.40 0.92 0.81
2 O1 72.3% 0.73 0.58 0.87 0.75 0.80 0.75 0.77 0.45 0.75 0.93 0.95 1.00 0.85 0.93 0.50 0.86 0.80
3 GPT-5.2 59.2% 0.76 0.65 0.86 0.76 0.83 0.68 0.71 0.62 0.68 0.95 1.00 1.00 0.84 0.95 0.30 0.92 0.80
4 GPT-4 Turbo 50.0% 0.72 0.54 0.85 0.73 0.83 0.51 0.52 0.45 0.51 0.92 0.95 0.85 0.95 0.87 0.43 0.78 0.71
5 GPT-4o Mini 32.1% 0.76 0.69 0.84 0.73 0.80 0.41 0.39 0.48 0.41 0.91 1.00 0.72 1.00 0.81 0.40 0.69 0.69
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 O1 49.6% 0.69 0.48 0.86 0.75 0.80 0.64 0.74 0.50 0.64 0.94 0.95 1.00 0.86 0.89 0.40 0.81 0.76
2 GPT-5.2 (xhigh) 51.6% 0.67 0.44 0.85 0.76 0.76 0.65 0.68 0.62 0.65 0.95 1.00 1.00 0.84 0.94 0.43 0.89 0.76
3 GPT-5.2 42.0% 0.72 0.52 0.86 0.76 0.82 0.55 0.55 0.56 0.55 0.97 1.00 1.00 0.92 0.93 0.36 0.89 0.75
4 GPT-4 Turbo 35.6% 0.72 0.52 0.85 0.73 0.84 0.38 0.38 0.47 0.38 0.98 0.98 0.96 1.00 0.85 0.44 0.72 0.69
5 GPT-4o Mini 21.3% 0.76 0.72 0.83 0.72 0.80 0.32 0.29 0.48 0.32 0.92 1.00 0.84 0.91 0.76 0.41 0.59 0.67