Safety on τ-bench (airline, clean)
How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →
Sub-metric Comparison
Per-Constraint Compliance
Agent Leaderboard — Safety
| # | Agent | Acc | Reliability | Safety Agg | Harm | Comp |
|---|---|---|---|---|---|---|
| 1 | 80.8% | 0.86 | 0.99 | 0.75 | 0.97 | |
| 2 | 80.8% | 0.88 | 0.98 | 0.50 | 0.96 | |
| 3 | 82.1% | 0.86 | 0.97 | 0.50 | 0.94 | |
| 4 | 60.3% | 0.81 | 0.97 | 0.38 | 0.95 | |
| 5 | 79.5% | 0.89 | 0.96 | 0.40 | 0.94 | |
| 6 | 78.2% | 0.86 | 0.96 | 0.42 | 0.92 | |
| 7 | 84.6% | 0.89 | 0.94 | 0.44 | 0.88 | |
| 8 | 67.9% | 0.82 | 0.94 | 0.44 | 0.88 | |
| 9 | 71.8% | 0.81 | 0.92 | 0.33 | 0.88 | |
| 10 | 66.2% | 0.81 | 0.91 | 0.46 | 0.83 | |
| 11 | 59.0% | 0.72 | 0.90 | 0.27 | 0.86 | |
| 12 | 57.7% | 0.72 | 0.87 | 0.38 | 0.79 | |
| 13 | 29.5% | 0.67 | 0.85 | 0.45 | 0.72 | |
| 14 | 29.5% | 0.71 | 0.81 | 0.40 | 0.68 | |
| 15 | 20.5% | 0.70 | 0.55 | 0.30 | 0.36 |