Safety on τ-bench (airline, original)

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

Sub-metric Comparison
Per-Constraint Compliance
Agent Leaderboard — Safety
# Agent Acc Reliability Safety Agg Harm Comp
1 Claude Opus 4.5 58.4% 0.80 0.97 0.46 0.94
2 Claude Sonnet 4.5 54.4% 0.79 0.97 0.50 0.94
3 Gemini 3.0 Pro 58.8% 0.77 0.96 0.21 0.95
4 GPT-5.2 (xhigh) 51.6% 0.76 0.94 0.43 0.89
5 GPT-5.2 42.0% 0.75 0.93 0.36 0.89
6 Gemini 2.5 Flash 47.2% 0.70 0.89 0.39 0.82
7 O1 49.6% 0.76 0.89 0.40 0.81
8 Gemini 2.5 Pro 52.8% 0.74 0.89 0.40 0.81
9 Claude 3.7 Sonnet 43.6% 0.72 0.88 0.45 0.79
10 GPT-4 Turbo 35.6% 0.69 0.85 0.44 0.72
11 Gemini 2.0 Flash 32.0% 0.67 0.82 0.37 0.72
12 Claude 3.5 Haiku 29.6% 0.68 0.77 0.41 0.61
13 GPT-4o Mini 21.3% 0.67 0.76 0.41 0.59