Safety on τ-bench (airline, clean)

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

Sub-metric Comparison
Per-Constraint Compliance
Agent Leaderboard — Safety
# Agent Acc Reliability Safety Agg Harm Comp
1 Gemini 3.5 Flash 80.8% 0.86 0.99 0.75 0.97
2 Claude Opus 4.5 80.8% 0.88 0.98 0.50 0.96
3 Gemini 3.1 Pro 82.1% 0.86 0.97 0.50 0.94
4 GPT-5.2 60.3% 0.81 0.97 0.38 0.95
5 GPT-5.5 79.5% 0.89 0.96 0.40 0.94
6 Claude Sonnet 4 78.2% 0.86 0.96 0.42 0.92
7 Claude Opus 4.7 84.6% 0.89 0.94 0.44 0.88
8 GPT-5.2 (medium) 67.9% 0.82 0.94 0.44 0.88
9 Gemini 2.5 Pro 71.8% 0.81 0.92 0.33 0.88
10 O1 66.2% 0.81 0.91 0.46 0.83
11 Gemini 2.5 Flash 59.0% 0.72 0.90 0.27 0.86
12 GPT-4 Turbo 57.7% 0.72 0.87 0.38 0.79
13 GPT-4o Mini 29.5% 0.67 0.85 0.45 0.72
14 Claude 3.5 Haiku 29.5% 0.71 0.81 0.40 0.68
15 Claude 3 Haiku 20.5% 0.70 0.55 0.30 0.36