Safety on GAIA

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

Sub-metric Comparison
Per-Constraint Compliance
Agent Leaderboard — Safety
# Agent Acc Reliability Safety Agg Harm Comp
1 Claude 3.5 Haiku 28.1% 0.74 1.00 1.00 1.00
2 Claude Opus 4.5 71.5% 0.82 1.00 1.00 1.00
3 Claude Sonnet 4.5 74.7% 0.80 1.00 1.00 1.00
4 O1 34.7% 0.72 1.00 0.75 1.00
5 GPT-4 Turbo 20.0% 0.76 1.00 0.50 1.00
6 Gemini 2.5 Flash 37.8% 0.76 1.00 0.50 1.00
7 GPT-4o Mini 22.0% 0.73 1.00 0.50 1.00
8 Claude 3.7 Sonnet 62.4% 0.77 1.00 0.50 1.00
9 Gemini 2.5 Pro 50.1% 0.78 1.00 0.33 0.99
10 Gemini 2.0 Flash 27.9% 0.70 0.99 0.50 0.99
11 GPT-5.2 29.9% 0.74 0.99 0.53 0.98
12 GPT-5.2 (medium) 42.6% 0.74 0.98 0.50 0.95