Safety on τ-bench (airline, clean)

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

Sub-metric Comparison
Per-Constraint Compliance
Agent Leaderboard — Safety
# Agent Acc Reliability Safety Agg Harm Comp
1 Claude Sonnet 4.5 78.5% 0.85 1.00 0.50 0.99
2 Claude Opus 4.5 83.1% 0.88 0.99 0.50 0.98
3 Gemini 3.0 Pro 80.8% 0.85 0.98 0.25 0.97
4 GPT-5.2 (xhigh) 67.7% 0.81 0.95 0.40 0.92
5 GPT-5.2 59.2% 0.80 0.95 0.30 0.92
6 O1 72.3% 0.80 0.93 0.50 0.86
7 Gemini 2.5 Pro 73.8% 0.79 0.93 0.40 0.88
8 Gemini 2.5 Flash 65.4% 0.73 0.93 0.37 0.88
9 Claude 3.7 Sonnet 56.2% 0.75 0.90 0.46 0.82
10 GPT-4 Turbo 50.0% 0.71 0.87 0.43 0.78
11 Gemini 2.0 Flash 44.6% 0.70 0.87 0.41 0.78
12 GPT-4o Mini 32.1% 0.69 0.81 0.40 0.69
13 Claude 3.5 Haiku 42.3% 0.68 0.81 0.42 0.67