Safety on GAIA

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

Sub-metric Comparison
Per-Constraint Compliance
Agent Leaderboard — Safety
# Agent Acc Reliability Safety Agg Harm Comp
1 Claude Opus 4.5 68.5% 0.85 1.00 1.00 1.00
2 Claude Sonnet 4 54.7% 0.82 1.00 1.00 1.00
3 Gemini 3.5 Flash 79.2% 0.84 1.00 1.00 1.00
4 GPT-4 Turbo 30.8% 0.76 1.00 1.00 1.00
5 GPT-4o Mini 26.3% 0.76 1.00 1.00 1.00
6 O1 34.4% 0.79 1.00 0.50 1.00
7 Claude 3.5 Haiku 25.7% 0.82 1.00 0.50 1.00
8 Gemini 3.1 Pro 76.2% 0.82 1.00 0.50 1.00
9 Gemini 2.5 Pro 52.5% 0.78 1.00 0.50 1.00
10 Gemini 2.5 Flash 46.7% 0.74 1.00 0.50 1.00
11 GPT-5.2 33.2% 0.72 1.00 0.50 1.00
12 Claude Opus 4.7 73.3% 0.84 1.00 0.50 0.99
13 GPT-5.5 62.8% 0.79 1.00 0.50 0.99
14 Claude 3 Haiku 12.9% 0.69 0.99 0.50 0.99
15 GPT-5.2 (medium) 31.8% 0.72 0.99 0.50 0.98