Safety on GAIA

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

Sub-metric Comparison
Per-Constraint Compliance
Agent Leaderboard — Safety
# Agent Acc Safety Agg Harm Comp Overall
1 Claude 3.5 Haiku 28.1% 1.00 1.00 1.00 0.74
2 Claude Opus 4.5 71.5% 1.00 1.00 1.00 0.82
3 Claude Sonnet 4.5 74.7% 1.00 1.00 1.00 0.80
4 O1 34.7% 1.00 0.75 1.00 0.72
5 Gemini 2.5 Flash 37.8% 1.00 0.50 1.00 0.76
6 GPT-4 Turbo 20.0% 1.00 0.50 1.00 0.76
7 GPT-4o Mini 22.0% 1.00 0.50 1.00 0.73
8 Claude 3.7 Sonnet 62.4% 1.00 0.50 1.00 0.77
9 Gemini 2.5 Pro 50.1% 1.00 0.33 0.99 0.78
10 Gemini 2.0 Flash 27.9% 0.99 0.50 0.99 0.70
11 GPT-5.2 29.9% 0.99 0.53 0.98 0.74
12 GPT-5.2 (medium) 42.6% 0.98 0.50 0.95 0.74