SWE-bench Verified Mini

SWE-bench Verified (Mini) is a random subset of 50 tasks of the original SWE-bench Verified. It is a light-weight version of the original SWE-bench Verified and is thus cheaper to evaluate.

Paper: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2023)
OpenAI Blog: SWE-bench Verified: Introducing SWE-bench Verified

Note: This subset of the original SWE-bench Verified contains different tasks then this recently released version with the same name. We are working on reconciling the two versions.

50
Verified Tasks
100%
Human Validated
7
Agents Evaluated

Key Features of SWE-bench Verified

Real-World Tasks

All tasks are sourced from actual GitHub issues, representing real software engineering problems.

Human Validation

Every task has been reviewed and validated by software engineers to be non-problematic.

Diverse Tasks

Tasks originate from PRs of 12 open-source Python repositories covering various domains.

SWE-Bench Verified Mini Leaderboard

Rank Agent Models Verified Accuracy Cost (USD) Runs Traces
1
Agentless Lite
o3-mini-2025-01-31 48.00% $11.34 1 Download
2 claude-3-5-sonnet-20241022 48.00% $4.08 1 Download
3 o1-mini-2024-09-12 34.00% $32.63 1 Download
4 gpt-4o-2024-08-06 34.00% $8.82 1 Download
5 gpt-4o-mini-2024-07-18 26.00% $0.77 1 Download
6 gpt-4o-mini-2024-07-18 20.00% $0.32 1 Download
7 gpt-4o-mini-2024-07-18 8.00% $29.54 1 Download
Note: The cost of Agentless (gpt-4o-mini-2024-07-18) does currently not update when token pricing is changed.

Accuracy vs. Cost Frontier for SWE-Bench Verified Mini

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for SWE-Bench Verified Mini

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Failure Categories

Distribution of Failures

Token Pricing Configuration

Adjust token prices to see how they affect the total cost calculations in the leaderboard and plots.

o1-mini-2024-09-12

Active
$
/1M tokens
$
/1M tokens

o3-mini-2025-01-31

Active
$
/1M tokens
$
/1M tokens

text-embedding-3-large

Active
$
/1M tokens
$
/1M tokens

text-embedding-3-small

Active
$
/1M tokens
$
/1M tokens

claude-3-5-sonnet-20241022

Active
$
/1M tokens
$
/1M tokens

gpt-4o-2024-08-06

Active
$
/1M tokens
$
/1M tokens

gpt-4o-mini-2024-07-18

Active
$
/1M tokens
$
/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on SWE-bench? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of SWE-bench tasks, including problem descriptions and test cases:

View Tasks