HAL Reliability Evaluation

Consistency on GAIA

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

#	Agent	Acc	Consistency Agg	Outc	Traj-D	Traj-S	Res	Overall
1	O1	34.7%	0.72	0.70	0.80	0.72	0.69	0.72
2	GPT-4 Turbo	20.0%	0.70	0.72	0.78	0.65	0.68	0.76
3	Claude 3.5 Haiku	28.1%	0.70	0.64	0.83	0.73	0.68	0.74
4	Claude Opus 4.5	71.5%	0.67	0.70	0.72	0.54	0.68	0.82
5	GPT-4o Mini	22.0%	0.63	0.64	0.66	0.54	0.66	0.73
6	Claude Sonnet 4.5	74.7%	0.63	0.64	0.69	0.49	0.66	0.80
7	Gemini 2.5 Pro	50.1%	0.62	0.60	0.74	0.57	0.62	0.78
8	Claude 3.7 Sonnet	62.4%	0.62	0.64	0.71	0.54	0.60	0.77
9	GPT-5.2	29.9%	0.62	0.58	0.76	0.56	0.61	0.74
10	Gemini 2.0 Flash	27.9%	0.61	0.60	0.76	0.60	0.54	0.70
11	Gemini 2.5 Flash	37.8%	0.58	0.52	0.69	0.57	0.60	0.76
12	GPT-5.2 (medium)	42.6%	0.58	0.55	0.70	0.50	0.59	0.74