Provider: Google

Model Comparison (Avg. across Benchmarks)
Overall Leaderboard
Agent Benchmark Acc Consistency Predictability Robustness Safety Overall
Gemini 3.0 Pro τ-bench (airline, clean) 80.8% 0.76 0.81 0.98 0.98 0.85
Gemini 2.5 Pro τ-bench (airline, clean) 73.8% 0.68 0.77 0.93 0.93 0.79
Gemini 2.5 Pro GAIA 50.1% 0.62 0.74 0.97 1.00 0.78
Gemini 3.0 Pro τ-bench (airline, original) 58.8% 0.76 0.60 0.94 0.96 0.77
Gemini 2.5 Flash GAIA 37.8% 0.58 0.72 0.98 1.00 0.76
Gemini 2.5 Pro τ-bench (airline, original) 52.8% 0.69 0.56 0.96 0.89 0.74
Gemini 2.5 Flash τ-bench (airline, clean) 65.4% 0.62 0.61 0.95 0.93 0.73
Gemini 2.5 Flash τ-bench (airline, original) 47.2% 0.64 0.52 0.95 0.89 0.70
Gemini 2.0 Flash τ-bench (airline, clean) 44.6% 0.65 0.48 0.98 0.87 0.70
Gemini 2.0 Flash GAIA 27.9% 0.61 0.72 0.76 0.99 0.70
Gemini 2.0 Flash τ-bench (airline, original) 32.0% 0.68 0.38 0.94 0.82 0.67
Reliability Trends
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Gemini 2.5 Pro 50.1% 0.62 0.60 0.74 0.57 0.62 0.74 0.77 0.75 0.74 0.97 1.00 1.00 0.90 1.00 0.33 0.99 0.78
2 Gemini 2.5 Flash 37.8% 0.58 0.52 0.69 0.57 0.60 0.72 0.72 0.83 0.72 0.98 1.00 1.00 0.93 1.00 0.50 1.00 0.76
3 Gemini 2.0 Flash 27.9% 0.61 0.60 0.76 0.60 0.54 0.72 0.66 0.83 0.72 0.76 0.88 0.67 0.73 0.99 0.50 0.99 0.70
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Gemini 3.0 Pro 80.8% 0.76 0.65 0.85 0.76 0.82 0.81 0.82 0.52 0.81 0.98 1.00 1.00 0.95 0.98 0.25 0.97 0.85
2 Gemini 2.5 Pro 73.8% 0.68 0.46 0.84 0.73 0.81 0.77 0.77 0.70 0.77 0.93 0.98 0.83 0.97 0.93 0.40 0.88 0.79
3 Gemini 2.5 Flash 65.4% 0.62 0.31 0.85 0.72 0.77 0.61 0.62 0.53 0.61 0.95 1.00 1.00 0.86 0.93 0.37 0.88 0.73
4 Gemini 2.0 Flash 44.6% 0.65 0.35 0.89 0.73 0.80 0.48 0.46 0.56 0.48 0.98 0.98 1.00 0.98 0.87 0.41 0.78 0.70
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Gemini 3.0 Pro 58.8% 0.76 0.62 0.86 0.77 0.84 0.60 0.60 0.52 0.60 0.94 1.00 0.92 0.91 0.96 0.21 0.95 0.77
2 Gemini 2.5 Pro 52.8% 0.69 0.48 0.84 0.72 0.80 0.56 0.56 0.58 0.56 0.96 0.97 0.98 0.92 0.89 0.40 0.81 0.74
3 Gemini 2.5 Flash 47.2% 0.64 0.38 0.85 0.70 0.76 0.52 0.52 0.54 0.52 0.95 1.00 0.97 0.88 0.89 0.39 0.82 0.70
4 Gemini 2.0 Flash 32.0% 0.68 0.44 0.87 0.73 0.81 0.38 0.36 0.61 0.38 0.94 0.98 1.00 0.85 0.82 0.37 0.72 0.67