SWE-bench Verified

SWE-bench Verified evaluates AI agents on real-world programming tasks from open-source repositories sourced from GitHub, focusing on code generation and bug fixing capabilities. The benchmark involves giving agents a code repository and issue description, and challenging them to generate a patch that resolves the problem described by the issue. SWE-bench Verified is a subset of the original test set from SWE-bench, consisting of 500 samples verified by software engineers.

Paper: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2023)
OpenAI Blog: SWE-bench Verified: Introducing SWE-bench Verified

500
Verified Tasks
100%
Human Validated
3
Agents Evaluated

Key Features of SWE-bench Verified

Real-World Tasks

All tasks are sourced from actual GitHub issues, representing real software engineering problems.

Expert Validation

Every task has been reviewed and validated by software engineers to be non-problematic.

Diverse Tasks

Tasks originate from PRs of 12 open-source Python repositories covering various domains.

SWE-Bench Verified Leaderboard

Rank Agent Models Verified Accuracy Cost (USD) Runs Traces
1 claude-3-5-sonnet-20241022 38.00% $67.09 1 Download
2 gpt-4o-2024-08-06 29.80% $79.84 1 Download
3 o1-mini-2024-09-12 27.20% $366.81 1 Download

Accuracy vs. Cost Frontier for SWE-Bench Verified

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for SWE-Bench Verified

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Failure Categories

Distribution of Failures

Token Pricing Configuration

Adjust token prices to see how they affect the total cost calculations in the leaderboard and plots.

o1-mini-2024-09-12

Active
$
/1M tokens
$
/1M tokens

claude-3-5-sonnet-20241022

Active
$
/1M tokens
$
/1M tokens

gpt-4o-2024-08-06

Active
$
/1M tokens
$
/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on SWE-bench? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of SWE-bench tasks, including problem descriptions and test cases:

View Tasks