Safety on τ-bench (airline, original)

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

Sub-metric Comparison
Per-Constraint Compliance
Agent Leaderboard — Safety
# Agent Acc Safety Agg Harm Comp Overall
1 Claude Opus 4.5 58.4% 0.97 0.46 0.94 0.80
2 Claude Sonnet 4.5 54.4% 0.97 0.50 0.94 0.79
3 Gemini 3.0 Pro 58.8% 0.96 0.21 0.95 0.77
4 GPT-5.2 (xhigh) 51.6% 0.94 0.43 0.89 0.76
5 GPT-5.2 42.0% 0.93 0.36 0.89 0.75
6 Gemini 2.5 Flash 47.2% 0.89 0.39 0.82 0.70
7 O1 49.6% 0.89 0.40 0.81 0.76
8 Gemini 2.5 Pro 52.8% 0.89 0.40 0.81 0.74
9 Claude 3.7 Sonnet 43.6% 0.88 0.45 0.79 0.72
10 GPT-4 Turbo 35.6% 0.85 0.44 0.72 0.69
11 Gemini 2.0 Flash 32.0% 0.82 0.37 0.72 0.67
12 Claude 3.5 Haiku 29.6% 0.77 0.41 0.61 0.68
13 GPT-4o Mini 21.3% 0.76 0.41 0.59 0.67