Provider: Google

Model Comparison (Avg. across Benchmarks)
Overall Leaderboard
Agent Benchmark Acc Reliability Consistency Predictability Robustness Safety
Gemini 3.1 Pro τ-bench (airline, clean) 82.1% 0.86 0.81 0.83 0.94 0.97
Gemini 3.5 Flash τ-bench (airline, clean) 80.8% 0.86 0.78 0.82 0.97 0.99
Gemini 3.5 Flash GAIA 79.2% 0.84 0.75 0.80 0.96 1.00
Gemini 3.1 Pro GAIA 76.2% 0.82 0.76 0.78 0.94 1.00
Gemini 2.5 Pro τ-bench (airline, clean) 71.8% 0.81 0.74 0.78 0.93 0.92
Gemini 2.5 Pro GAIA 52.5% 0.78 0.72 0.71 0.91 1.00
Gemini 3.0 Pro τ-bench (airline, original) 58.8% 0.77 0.76 0.60 0.94 0.96
Gemini 2.5 Flash GAIA 46.7% 0.74 0.70 0.64 0.87 1.00
Gemini 2.5 Pro τ-bench (airline, original) 52.8% 0.74 0.69 0.56 0.96 0.89
Gemini 2.5 Flash τ-bench (airline, clean) 59.0% 0.72 0.75 0.60 0.82 0.90
Gemini 2.5 Flash τ-bench (airline, original) 47.2% 0.70 0.64 0.52 0.95 0.89
Gemini 2.0 Flash τ-bench (airline, original) 32.0% 0.67 0.68 0.38 0.94 0.82
Reliability Trends
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Gemini 3.5 Flash 79.2% 0.84 0.75 0.84 0.82 0.61 0.69 0.80 0.79 0.57 0.80 0.96 1.00 1.00 0.88 1.00 1.00 1.00
2 Gemini 3.1 Pro 76.2% 0.82 0.76 0.83 0.81 0.61 0.73 0.78 0.79 0.72 0.78 0.94 0.99 0.96 0.86 1.00 0.50 1.00
3 Gemini 2.5 Pro 52.5% 0.78 0.72 0.73 0.87 0.71 0.65 0.71 0.71 0.75 0.71 0.91 0.95 0.97 0.82 1.00 0.50 1.00
4 Gemini 2.5 Flash 46.7% 0.74 0.70 0.73 0.82 0.64 0.64 0.64 0.63 0.77 0.64 0.87 0.97 0.96 0.69 1.00 0.50 1.00
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Gemini 3.1 Pro 82.1% 0.86 0.81 0.79 0.82 0.74 0.85 0.83 0.83 0.57 0.83 0.94 0.98 1.00 0.84 0.97 0.50 0.94
2 Gemini 3.5 Flash 80.8% 0.86 0.78 0.86 0.75 0.65 0.78 0.82 0.81 0.52 0.82 0.97 1.00 1.00 0.92 0.99 0.75 0.97
3 Gemini 2.5 Pro 71.8% 0.81 0.74 0.66 0.85 0.72 0.77 0.78 0.77 0.67 0.78 0.93 1.00 1.00 0.79 0.92 0.33 0.88
4 Gemini 2.5 Flash 59.0% 0.72 0.75 0.73 0.83 0.70 0.76 0.60 0.60 0.53 0.60 0.82 0.78 0.85 0.83 0.90 0.27 0.86
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Gemini 3.0 Pro 58.8% 0.77 0.76 0.62 0.86 0.77 0.84 0.60 0.60 0.52 0.60 0.94 1.00 0.92 0.91 0.96 0.21 0.95
2 Gemini 2.5 Pro 52.8% 0.74 0.69 0.48 0.84 0.72 0.80 0.56 0.56 0.58 0.56 0.96 0.97 0.98 0.92 0.89 0.40 0.81
3 Gemini 2.5 Flash 47.2% 0.70 0.64 0.38 0.85 0.70 0.76 0.52 0.52 0.54 0.52 0.95 1.00 0.97 0.88 0.89 0.39 0.82
4 Gemini 2.0 Flash 32.0% 0.67 0.68 0.44 0.87 0.73 0.81 0.38 0.36 0.61 0.38 0.94 0.98 1.00 0.85 0.82 0.37 0.72