HAL Reliability Evaluation

Consistency on GAIA

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

#	Agent	Acc	Reliability	Consistency Agg	Outc	Traj-D	Traj-S	Res
1	Claude Opus 4.5	68.5%	0.85	0.81	0.84	0.89	0.75	0.77
2	Claude Opus 4.7	73.3%	0.84	0.81	0.84	0.87	0.74	0.77
3	Claude 3.5 Haiku	25.7%	0.82	0.80	0.82	0.89	0.76	0.75
4	Claude Sonnet 4	54.7%	0.82	0.78	0.76	0.87	0.75	0.76
5	GPT-4 Turbo	30.8%	0.76	0.76	0.73	0.90	0.79	0.71
6	Gemini 3.1 Pro	76.2%	0.82	0.76	0.83	0.81	0.61	0.73
7	Gemini 3.5 Flash	79.2%	0.84	0.75	0.84	0.82	0.61	0.69
8	GPT-4o Mini	26.3%	0.76	0.73	0.75	0.87	0.72	0.65
9	Gemini 2.5 Pro	52.5%	0.78	0.72	0.73	0.87	0.71	0.65
10	Claude 3 Haiku	12.9%	0.69	0.71	0.84	0.78	0.55	0.61
11	Gemini 2.5 Flash	46.7%	0.74	0.70	0.73	0.82	0.64	0.64
12	O1	34.4%	0.79	0.68	0.58	0.85	0.78	0.65
13	GPT-5.2 (medium)	31.8%	0.72	0.62	0.65	0.74	0.56	0.56
14	GPT-5.5	62.8%	0.79	0.61	0.60	0.71	0.53	0.61
15	GPT-5.2	33.2%	0.72	0.59	0.58	0.77	0.54	0.54