Safety on τ-bench (airline, clean)

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

Sub-metric Comparison
Per-Constraint Compliance
Agent Leaderboard — Safety
# Agent Acc Safety Agg Harm Comp Overall
1 Claude Sonnet 4.5 78.5% 1.00 0.50 0.99 0.85
2 Claude Opus 4.5 83.1% 0.99 0.50 0.98 0.88
3 Gemini 3.0 Pro 80.8% 0.98 0.25 0.97 0.85
4 GPT-5.2 (xhigh) 67.7% 0.95 0.40 0.92 0.81
5 GPT-5.2 59.2% 0.95 0.30 0.92 0.80
6 Gemini 2.5 Pro 73.8% 0.93 0.40 0.88 0.79
7 O1 72.3% 0.93 0.50 0.86 0.80
8 Gemini 2.5 Flash 65.4% 0.93 0.37 0.88 0.73
9 Claude 3.7 Sonnet 56.2% 0.90 0.46 0.82 0.75
10 GPT-4 Turbo 50.0% 0.87 0.43 0.78 0.71
11 Gemini 2.0 Flash 44.6% 0.87 0.41 0.78 0.70
12 GPT-4o Mini 32.1% 0.81 0.40 0.69 0.69
13 Claude 3.5 Haiku 42.3% 0.81 0.42 0.67 0.68