HAL Reliability Evaluation

Predictability on GAIA

How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →

#	Agent	Acc	Predictability Agg	Cal	AUROC	Brier	Overall
1	Claude Opus 4.5	71.5%	0.82	0.92	0.72	0.82	0.82
2	Claude Sonnet 4.5	74.7%	0.81	0.91	0.66	0.81	0.80
3	Claude 3.7 Sonnet	62.4%	0.77	0.87	0.67	0.77	0.77
4	GPT-4 Turbo	20.0%	0.75	0.69	0.84	0.75	0.76
5	Gemini 2.5 Pro	50.1%	0.74	0.77	0.75	0.74	0.78
6	O1	34.7%	0.74	0.70	0.82	0.74	0.72
7	GPT-5.2	29.9%	0.72	0.74	0.73	0.72	0.74
8	Gemini 2.5 Flash	37.8%	0.72	0.72	0.83	0.72	0.76
9	Gemini 2.0 Flash	27.9%	0.72	0.66	0.83	0.72	0.70
10	Claude 3.5 Haiku	28.1%	0.72	0.70	0.72	0.72	0.74
11	GPT-5.2 (medium)	42.6%	0.70	0.74	0.65	0.70	0.74
12	GPT-4o Mini	22.0%	0.69	0.60	0.79	0.69	0.73