Scicode Zero Shot Agent

Agent performance overview across all HAL benchmarks

Benchmarks

Models Used

Pareto Optimal Runs

Models Used

o4-mini Low (April 2025) GPT-4.1 (April 2025) o4-mini High (April 2025) o3 Medium (April 2025) DeepSeek V3 (March 2025) Claude-3.7 Sonnet High (February 2025) Gemini 2.0 Flash (February 2025) DeepSeek R1 (May 2025) Claude-3.7 Sonnet (February 2025)

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark	Model	Accuracy	Cost	On the Pareto Frontier?
Scicode	o4-mini Low (April 2025)	9.23%	$1.74	Yes
Scicode	GPT-4.1 (April 2025)	6.15%	$2.82	No
Scicode	o4-mini High (April 2025)	6.15%	$5.37	No
Scicode	o3 Medium (April 2025)	4.62%	$6.03	No
Scicode	DeepSeek V3 (March 2025)	3.08%	$0.79	No
Scicode	Claude-3.7 Sonnet High (February 2025)	3.08%	$4.99	No
Scicode	Gemini 2.0 Flash (February 2025)	1.54%	$0.12	Yes
Scicode	DeepSeek R1 (May 2025)	0.00%	$2.19	No
Scicode	Claude-3.7 Sonnet (February 2025)	0.00%	$5.10	No