Provider:
Anthropic
Model Comparison (Avg. across Benchmarks)
Overall Leaderboard
| Agent | Benchmark | Acc | Consistency | Predictability | Robustness | Safety | Overall |
|---|---|---|---|---|---|---|---|
| τ-bench (airline, clean) | 83.1% | 0.82 | 0.87 | 0.94 | 0.99 | 0.88 | |
| τ-bench (airline, clean) | 78.5% | 0.72 | 0.84 | 0.99 | 1.00 | 0.85 | |
| GAIA | 71.5% | 0.67 | 0.82 | 0.96 | 1.00 | 0.82 | |
| τ-bench (airline, original) | 58.4% | 0.79 | 0.69 | 0.93 | 0.97 | 0.80 | |
| GAIA | 74.7% | 0.63 | 0.81 | 0.95 | 1.00 | 0.80 | |
| τ-bench (airline, original) | 54.4% | 0.73 | 0.66 | 0.98 | 0.97 | 0.79 | |
| GAIA | 62.4% | 0.62 | 0.77 | 0.92 | 1.00 | 0.77 | |
| τ-bench (airline, clean) | 56.2% | 0.65 | 0.63 | 0.97 | 0.90 | 0.75 | |
| GAIA | 28.1% | 0.70 | 0.72 | 0.79 | 1.00 | 0.74 | |
| τ-bench (airline, original) | 43.6% | 0.67 | 0.54 | 0.96 | 0.88 | 0.72 | |
| τ-bench (airline, original) | 29.6% | 0.70 | 0.46 | 0.88 | 0.77 | 0.68 | |
| τ-bench (airline, clean) | 42.3% | 0.66 | 0.53 | 0.85 | 0.81 | 0.68 |
Reliability Trends
| # | Agent | Acc | Consistency | Predictability | Robustness | Safety | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agg | Outc | Traj-D | Traj-S | Res | Agg | Cal | AUROC | Brier | Agg | Fault | Struct | Prompt | Agg | Harm | Comp | ||||
| 1 | 71.5% | 0.67 | 0.70 | 0.72 | 0.54 | 0.68 | 0.82 | 0.92 | 0.72 | 0.82 | 0.96 | 1.00 | 1.00 | 0.89 | 1.00 | 1.00 | 1.00 | 0.82 | |
| 2 | 74.7% | 0.63 | 0.64 | 0.69 | 0.49 | 0.66 | 0.81 | 0.91 | 0.66 | 0.81 | 0.95 | 0.99 | 0.93 | 0.94 | 1.00 | 1.00 | 1.00 | 0.80 | |
| 3 | 62.4% | 0.62 | 0.64 | 0.71 | 0.54 | 0.60 | 0.77 | 0.87 | 0.67 | 0.77 | 0.92 | 0.93 | 0.95 | 0.87 | 1.00 | 0.50 | 1.00 | 0.77 | |
| 4 | 28.1% | 0.70 | 0.64 | 0.83 | 0.73 | 0.68 | 0.72 | 0.70 | 0.72 | 0.72 | 0.79 | 0.96 | 0.78 | 0.63 | 1.00 | 1.00 | 1.00 | 0.74 | |
| # | Agent | Acc | Consistency | Predictability | Robustness | Safety | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agg | Outc | Traj-D | Traj-S | Res | Agg | Cal | AUROC | Brier | Agg | Fault | Struct | Prompt | Agg | Harm | Comp | ||||
| 1 | 83.1% | 0.82 | 0.77 | 0.88 | 0.79 | 0.85 | 0.87 | 0.93 | 0.68 | 0.87 | 0.94 | 0.97 | 0.93 | 0.93 | 0.99 | 0.50 | 0.98 | 0.88 | |
| 2 | 78.5% | 0.72 | 0.50 | 0.85 | 0.77 | 0.85 | 0.84 | 0.90 | 0.68 | 0.84 | 0.99 | 1.00 | 1.00 | 0.98 | 1.00 | 0.50 | 0.99 | 0.85 | |
| 3 | 56.2% | 0.65 | 0.35 | 0.84 | 0.74 | 0.81 | 0.63 | 0.65 | 0.48 | 0.63 | 0.97 | 1.00 | 1.00 | 0.91 | 0.90 | 0.46 | 0.82 | 0.75 | |
| 4 | 42.3% | 0.66 | 0.42 | 0.82 | 0.70 | 0.80 | 0.53 | 0.53 | 0.42 | 0.53 | 0.85 | 0.81 | 1.00 | 0.74 | 0.81 | 0.42 | 0.67 | 0.68 | |
| # | Agent | Acc | Consistency | Predictability | Robustness | Safety | Overall | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Agg | Outc | Traj-D | Traj-S | Res | Agg | Cal | AUROC | Brier | Agg | Fault | Struct | Prompt | Agg | Harm | Comp | ||||
| 1 | 58.4% | 0.79 | 0.68 | 0.88 | 0.78 | 0.85 | 0.69 | 0.71 | 0.67 | 0.69 | 0.93 | 0.95 | 0.96 | 0.89 | 0.97 | 0.46 | 0.94 | 0.80 | |
| 2 | 54.4% | 0.73 | 0.54 | 0.86 | 0.77 | 0.85 | 0.66 | 0.68 | 0.61 | 0.66 | 0.98 | 0.97 | 1.00 | 0.97 | 0.97 | 0.50 | 0.94 | 0.79 | |
| 3 | 43.6% | 0.67 | 0.40 | 0.82 | 0.72 | 0.82 | 0.54 | 0.54 | 0.53 | 0.54 | 0.96 | 1.00 | 1.00 | 0.89 | 0.88 | 0.45 | 0.79 | 0.72 | |
| 4 | 29.6% | 0.70 | 0.54 | 0.83 | 0.71 | 0.80 | 0.46 | 0.45 | 0.44 | 0.46 | 0.88 | 0.86 | 1.00 | 0.77 | 0.77 | 0.41 | 0.61 | 0.68 | |