Provider: Anthropic

Model Comparison (Avg. across Benchmarks)
Overall Leaderboard
Agent Benchmark Acc Reliability Consistency Predictability Robustness Safety
Claude Opus 4.7 τ-bench (airline, clean) 84.6% 0.89 0.88 0.87 0.91 0.94
Claude Opus 4.5 τ-bench (airline, clean) 80.8% 0.88 0.84 0.85 0.97 0.98
Claude Sonnet 4 τ-bench (airline, clean) 78.2% 0.86 0.76 0.85 0.96 0.96
Claude Opus 4.5 GAIA 68.5% 0.85 0.81 0.84 0.91 1.00
Claude Opus 4.7 GAIA 73.3% 0.84 0.81 0.73 0.99 1.00
Claude 3.5 Haiku GAIA 25.7% 0.82 0.80 0.72 0.94 1.00
Claude Sonnet 4 GAIA 54.7% 0.82 0.78 0.73 0.94 1.00
Claude Opus 4.5 τ-bench (airline, original) 58.4% 0.80 0.79 0.69 0.93 0.97
Claude Sonnet 4.5 τ-bench (airline, original) 54.4% 0.79 0.73 0.66 0.98 0.97
Claude 3.7 Sonnet τ-bench (airline, original) 43.6% 0.72 0.67 0.54 0.96 0.88
Claude 3.5 Haiku τ-bench (airline, clean) 29.5% 0.71 0.73 0.47 0.93 0.81
Claude 3 Haiku τ-bench (airline, clean) 20.5% 0.70 0.74 0.38 0.98 0.55
Claude 3 Haiku GAIA 12.9% 0.69 0.71 0.49 0.86 0.99
Claude 3.5 Haiku τ-bench (airline, original) 29.6% 0.68 0.70 0.46 0.88 0.77
Reliability Trends
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Claude Opus 4.5 68.5% 0.85 0.81 0.84 0.89 0.75 0.77 0.84 0.97 0.80 0.84 0.91 0.99 0.98 0.77 1.00 1.00 1.00
2 Claude Opus 4.7 73.3% 0.84 0.81 0.84 0.87 0.74 0.77 0.73 0.76 0.63 0.73 0.99 1.00 1.00 0.96 1.00 0.50 0.99
3 Claude 3.5 Haiku 25.7% 0.82 0.80 0.82 0.89 0.76 0.75 0.72 0.67 0.76 0.72 0.94 1.00 1.00 0.82 1.00 0.50 1.00
4 Claude Sonnet 4 54.7% 0.82 0.78 0.76 0.87 0.75 0.76 0.73 0.77 0.71 0.73 0.94 1.00 1.00 0.83 1.00 1.00 1.00
5 Claude 3 Haiku 12.9% 0.69 0.71 0.84 0.78 0.55 0.61 0.49 0.38 0.73 0.49 0.86 1.00 0.94 0.66 0.99 0.50 0.99
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Claude Opus 4.7 84.6% 0.89 0.88 0.90 0.94 0.82 0.86 0.87 0.93 0.70 0.87 0.91 0.91 1.00 0.83 0.94 0.44 0.88
2 Claude Opus 4.5 80.8% 0.88 0.84 0.83 0.86 0.81 0.84 0.85 0.90 0.70 0.85 0.97 1.00 0.98 0.92 0.98 0.50 0.96
3 Claude Sonnet 4 78.2% 0.86 0.76 0.62 0.86 0.78 0.84 0.85 0.90 0.67 0.85 0.96 0.99 0.98 0.92 0.96 0.42 0.92
4 Claude 3.5 Haiku 29.5% 0.71 0.73 0.56 0.86 0.75 0.82 0.47 0.46 0.42 0.47 0.93 1.00 0.78 1.00 0.81 0.40 0.68
5 Claude 3 Haiku 20.5% 0.70 0.74 0.76 0.76 0.61 0.78 0.38 0.33 0.41 0.38 0.98 1.00 0.94 1.00 0.55 0.30 0.36
# Agent Acc Reliability Consistency Predictability Robustness Safety
Agg Outc Traj-D Traj-S Res Agg Cal AUROC Brier Agg Fault Struct Prompt Agg Harm Comp
1 Claude Opus 4.5 58.4% 0.80 0.79 0.68 0.88 0.78 0.85 0.69 0.71 0.67 0.69 0.93 0.95 0.96 0.89 0.97 0.46 0.94
2 Claude Sonnet 4.5 54.4% 0.79 0.73 0.54 0.86 0.77 0.85 0.66 0.68 0.61 0.66 0.98 0.97 1.00 0.97 0.97 0.50 0.94
3 Claude 3.7 Sonnet 43.6% 0.72 0.67 0.40 0.82 0.72 0.82 0.54 0.54 0.53 0.54 0.96 1.00 1.00 0.89 0.88 0.45 0.79
4 Claude 3.5 Haiku 29.6% 0.68 0.70 0.54 0.83 0.71 0.80 0.46 0.45 0.44 0.46 0.88 0.86 1.00 0.77 0.77 0.41 0.61