USACO

The USACO benchmark evaluates AI agents on competitive programming problems from the USA Computing Olympiad. It consists of 307 problems, complete with exhaustive test cases, problem analyses, and difficulty labels.

Paper: Can Language Models Solve Olympiad Programming? (Shi et al., 2024)

307
Total Tasks
26
Agents Evaluated

Key Features of USACO

Difficulty Levels

Tasks span Bronze to Platinum difficulty levels, requiring knowledge of data structures and algorithms.

Real Competition Tasks

Problems are sourced from actual USACO competitions, consisting of problems challenging for human programmers.

USACO Leaderboard

Rank Agent Models Verified Accuracy Cost (USD) Runs Traces
1 o3-mini-2025-01-31 med. 38.76% $19.68 1 Download
2 o3-mini-2025-01-31 med. 36.16% $9.58 1 Download
3 claude-3-7-sonnet-20250219 28.99% $38.94 1 Download
4 gpt-4.5-preview-2025-02-27 27.04% $367.27 1 Download
5 o1-mini-2024-09-12 23.78% $13.76 1 Download
6 claude-3-7-sonnet-20250219 19.54% $8.77 1 Download
7 gpt-4.5-preview-2025-02-27 16.94% $57.11 1 Download
8 gpt-4o-2024-05-13 16.94% $41.28 1 Download
9 gpt-4o-2024-05-13 16.29% (-1.30/+1.30) $25.62 (-0.10/+0.10) 2 Download
10 gpt-4o-mini-2024-07-18 15.42% (-0.43/+0.87) $1.96 (-0.03/+0.02) 3 Download
11 claude-3-5-sonnet-20241022 15.31% $2.75 1 Download
12 gpt-4o-2024-05-13 15.31% $21.01 1 Download
13 gpt-4o-mini-2024-07-18 14.33% $1.02 1 Download
14 gpt-4o-mini-2024-07-18 13.36% $1.33 1 Download
15 gpt-4o-2024-05-13 13.36% $13.38 1 Download
16 gpt-4o-mini-2024-07-18 13.03% $1.21 1 Download
17 gpt-4o-mini-2024-07-18 12.05% $0.76 1 Download
18 gpt-4o-mini-2024-07-18 10.75% $0.81 1 Download
19 gpt-4o-mini-2024-07-18 9.88% (-0.76/+0.54) $0.64 (-0.00/+0.00) 3 Download
20 claude-3-5-haiku-20241022 9.45% $1.25 1 Download
21 gpt-4o-2024-05-13 8.79% $4.32 1 Download
22 gpt-4o-mini-2024-07-18 8.47% $1.64 1 Download
23 gpt-4o-mini-2024-07-18 8.47% $0.25 1 Download
24 gpt-4o-2024-05-13 8.47% $23.81 1 Download
25 Meta-Llama-3.3-70B-Instruct-Turbo 7.17% $0.30 1 Download
26 gpt-4o-mini-2024-07-18 6.51% $1.07 1 Download

Accuracy vs. Cost Frontier for USACO

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for USACO

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Failure Categories

Distribution of Failures

Additional Resources

Getting Started

Want to evaluate your agent on USACO? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of USACO tasks, including difficulty levels and categories:

View Tasks