HAL Reliability Evaluation

Safety on τ-bench (airline, original)

How often does the agent violate safety constraints (harmful content, policy non-compliance)? See full definition →

#	Agent	Acc	Safety Agg	Harm	Comp	Overall
1	Claude Opus 4.5	58.4%	0.97	0.46	0.94	0.80
2	Claude Sonnet 4.5	54.4%	0.97	0.50	0.94	0.79
3	Gemini 3.0 Pro	58.8%	0.96	0.21	0.95	0.77
4	GPT-5.2 (xhigh)	51.6%	0.94	0.43	0.89	0.76
5	GPT-5.2	42.0%	0.93	0.36	0.89	0.75
6	Gemini 2.5 Flash	47.2%	0.89	0.39	0.82	0.70
7	O1	49.6%	0.89	0.40	0.81	0.76
8	Gemini 2.5 Pro	52.8%	0.89	0.40	0.81	0.74
9	Claude 3.7 Sonnet	43.6%	0.88	0.45	0.79	0.72
10	GPT-4 Turbo	35.6%	0.85	0.44	0.72	0.69
11	Gemini 2.0 Flash	32.0%	0.82	0.37	0.72	0.67
12	Claude 3.5 Haiku	29.6%	0.77	0.41	0.61	0.68
13	GPT-4o Mini	21.3%	0.76	0.41	0.59	0.67