HAL Reliability Evaluation

Consistency on GAIA

How reproducible are the agent's answers and trajectories across repeated runs on the same task? See full definition →

Each cell represents a task. Color shows outcome consistency across runs. Hover to see task ID.

#	Agent	Acc	Reliability	Consistency Agg	Outc	Traj-D	Traj-S	Res
1	O1	34.7%	0.72	0.72	0.70	0.80	0.72	0.69
2	GPT-4 Turbo	20.0%	0.76	0.70	0.72	0.78	0.65	0.68
3	Claude 3.5 Haiku	28.1%	0.74	0.70	0.64	0.83	0.73	0.68
4	Claude Opus 4.5	71.5%	0.82	0.67	0.70	0.72	0.54	0.68
5	GPT-4o Mini	22.0%	0.73	0.63	0.64	0.66	0.54	0.66
6	Claude Sonnet 4.5	74.7%	0.80	0.63	0.64	0.69	0.49	0.66
7	Gemini 2.5 Pro	50.1%	0.78	0.62	0.60	0.74	0.57	0.62
8	Claude 3.7 Sonnet	62.4%	0.77	0.62	0.64	0.71	0.54	0.60
9	GPT-5.2	29.9%	0.74	0.62	0.58	0.76	0.56	0.61
10	Gemini 2.0 Flash	27.9%	0.70	0.61	0.60	0.76	0.60	0.54
11	Gemini 2.5 Flash	37.8%	0.76	0.58	0.52	0.69	0.57	0.60
12	GPT-5.2 (medium)	42.6%	0.74	0.58	0.55	0.70	0.50	0.59