USACO Episodic + Semantic

Agent performance overview across all HAL benchmarks

Benchmarks

Models Used

Pareto Optimal Runs

Models Used

GPT-5 Medium (August 2025) o4-mini High (April 2025) o4-mini Low (April 2025) Claude Opus 4.1 High (August 2025) Claude Opus 4.1 (August 2025) o3 Medium (April 2025) GPT-4.1 (April 2025) DeepSeek V3 (March 2025) DeepSeek R1 (January 2025) Claude-3.7 Sonnet (February 2025) Gemini 2.0 Flash (February 2025) Claude-3.7 Sonnet High (February 2025)

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark	Model	Accuracy	Cost	On the Pareto Frontier?
Usaco	GPT-5 Medium (August 2025)	69.06%	$116.63	Yes
Usaco	o4-mini High (April 2025)	64.82%	$77.28	Yes
Usaco	o4-mini Low (April 2025)	53.09%	$24.60	Yes
Usaco	Claude Opus 4.1 High (August 2025)	51.47%	$267.72	No
Usaco	Claude Opus 4.1 (August 2025)	48.21%	$276.19	No
Usaco	o3 Medium (April 2025)	46.25%	$57.30	No
Usaco	GPT-4.1 (April 2025)	44.95%	$28.10	No
Usaco	DeepSeek V3 (March 2025)	39.09%	$12.08	Yes
Usaco	DeepSeek R1 (January 2025)	38.11%	$80.04	No
Usaco	Claude-3.7 Sonnet (February 2025)	29.32%	$38.70	No
Usaco	Gemini 2.0 Flash (February 2025)	27.04%	$1.46	Yes
Usaco	Claude-3.7 Sonnet High (February 2025)	26.71%	$56.43	No