HAL Reliability Evaluation

Predictability on GAIA

How well does the agent's expressed confidence predict whether it will answer correctly? See full definition →

#	Agent	Acc	Reliability	Predictability Agg	Cal	AUROC	Brier
1	Claude Opus 4.5	68.5%	0.85	0.84	0.97	0.80	0.84
2	GPT-5.5	62.8%	0.79	0.80	0.88	0.76	0.80
3	Gemini 3.5 Flash	79.2%	0.84	0.80	0.79	0.57	0.80
4	Gemini 3.1 Pro	76.2%	0.82	0.78	0.79	0.72	0.78
5	GPT-5.2	33.2%	0.72	0.78	0.81	0.80	0.78
6	Claude Sonnet 4	54.7%	0.82	0.73	0.77	0.71	0.73
7	Claude Opus 4.7	73.3%	0.84	0.73	0.76	0.63	0.73
8	Claude 3.5 Haiku	25.7%	0.82	0.72	0.67	0.76	0.72
9	Gemini 2.5 Pro	52.5%	0.78	0.71	0.71	0.75	0.71
10	O1	34.4%	0.79	0.68	0.66	0.76	0.68
11	Gemini 2.5 Flash	46.7%	0.74	0.64	0.63	0.77	0.64
12	GPT-4 Turbo	30.8%	0.76	0.64	0.60	0.75	0.64
13	GPT-5.2 (medium)	31.8%	0.72	0.61	0.61	0.61	0.61
14	GPT-4o Mini	26.3%	0.76	0.58	0.52	0.72	0.58
15	Claude 3 Haiku	12.9%	0.69	0.49	0.38	0.73	0.49