Provider: Anthropic

Model Comparison (Avg. across Benchmarks)
Overall Leaderboard
Agent Benchmark Acc Consistency Predictability Robustness Safety Overall
Claude Opus 4.5 τ-bench (airline, clean) 83.1% 0.82 0.87 0.94 0.99 0.88
Claude Sonnet 4.5 τ-bench (airline, clean) 78.5% 0.72 0.84 0.99 1.00 0.85
Claude Opus 4.5 GAIA 71.5% 0.67 0.82 0.96 1.00 0.82
Claude Opus 4.5 τ-bench (airline, original) 58.4% 0.79 0.69 0.93 0.97 0.80
Claude Sonnet 4.5 GAIA 74.7% 0.63 0.81 0.95 1.00 0.80
Claude Sonnet 4.5 τ-bench (airline, original) 54.4% 0.73 0.66 0.98 0.97 0.79
Claude 3.7 Sonnet GAIA 62.4% 0.62 0.77 0.92 1.00 0.77
Claude 3.7 Sonnet τ-bench (airline, clean) 56.2% 0.65 0.63 0.97 0.90 0.75
Claude 3.5 Haiku GAIA 28.1% 0.70 0.72 0.79 1.00 0.74
Claude 3.7 Sonnet τ-bench (airline, original) 43.6% 0.67 0.54 0.96 0.88 0.72
Claude 3.5 Haiku τ-bench (airline, original) 29.6% 0.70 0.46 0.88 0.77 0.68
Claude 3.5 Haiku τ-bench (airline, clean) 42.3% 0.66 0.53 0.85 0.81 0.68
Reliability Trends
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Claude Opus 4.5 71.5% 0.67 0.70 0.72 0.54 0.68 0.82 0.92 0.72 0.82 0.96 1.00 1.00 0.89 1.00 1.00 1.00 0.82
2 Claude Sonnet 4.5 74.7% 0.63 0.64 0.69 0.49 0.66 0.81 0.91 0.66 0.81 0.95 0.99 0.93 0.94 1.00 1.00 1.00 0.80
3 Claude 3.7 Sonnet 62.4% 0.62 0.64 0.71 0.54 0.60 0.77 0.87 0.67 0.77 0.92 0.93 0.95 0.87 1.00 0.50 1.00 0.77
4 Claude 3.5 Haiku 28.1% 0.70 0.64 0.83 0.73 0.68 0.72 0.70 0.72 0.72 0.79 0.96 0.78 0.63 1.00 1.00 1.00 0.74
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Claude Opus 4.5 83.1% 0.82 0.77 0.88 0.79 0.85 0.87 0.93 0.68 0.87 0.94 0.97 0.93 0.93 0.99 0.50 0.98 0.88
2 Claude Sonnet 4.5 78.5% 0.72 0.50 0.85 0.77 0.85 0.84 0.90 0.68 0.84 0.99 1.00 1.00 0.98 1.00 0.50 0.99 0.85
3 Claude 3.7 Sonnet 56.2% 0.65 0.35 0.84 0.74 0.81 0.63 0.65 0.48 0.63 0.97 1.00 1.00 0.91 0.90 0.46 0.82 0.75
4 Claude 3.5 Haiku 42.3% 0.66 0.42 0.82 0.70 0.80 0.53 0.53 0.42 0.53 0.85 0.81 1.00 0.74 0.81 0.42 0.67 0.68
# Agent Acc Consistency Predictability Robustness Safety Overall
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Claude Opus 4.5 58.4% 0.79 0.68 0.88 0.78 0.85 0.69 0.71 0.67 0.69 0.93 0.95 0.96 0.89 0.97 0.46 0.94 0.80
2 Claude Sonnet 4.5 54.4% 0.73 0.54 0.86 0.77 0.85 0.66 0.68 0.61 0.66 0.98 0.97 1.00 0.97 0.97 0.50 0.94 0.79
3 Claude 3.7 Sonnet 43.6% 0.67 0.40 0.82 0.72 0.82 0.54 0.54 0.53 0.54 0.96 1.00 1.00 0.89 0.88 0.45 0.79 0.72
4 Claude 3.5 Haiku 29.6% 0.70 0.54 0.83 0.71 0.80 0.46 0.45 0.44 0.46 0.88 0.86 1.00 0.77 0.77 0.41 0.61 0.68