SWE-bench Verified

SWE-bench Verified evaluates AI agents on real-world programming tasks from open-source repositories sourced from GitHub, focusing on code generation and bug fixing capabilities. The benchmark involves giving agents a code repository and issue description, and challenging them to generate a patch that resolves the problem described by the issue. SWE-bench Verified is a subset of the original test set from SWE-bench, consisting of 500 samples verified by software engineers.

Paper: SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2023)
OpenAI Blog: SWE-bench Verified: Introducing SWE-bench Verified

500

Verified Tasks

100%

Human Validated

Agents Evaluated

Key Features of SWE-bench Verified

Real-World Tasks

All tasks are sourced from actual GitHub issues, representing real software engineering problems.

Expert Validation

Every task has been reviewed and validated by software engineers to be non-problematic.

Diverse Tasks

Tasks originate from PRs of 12 open-source Python repositories covering various domains.

SWE-Bench Verified Leaderboard

Rank	Agent	Models	Verified Verified Results Results have been reproduced by the HAL team	Accuracy Accuracy Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Cost (USD) Total Cost Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Runs Number of Runs The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name	Traces
1	Moatless	claude-3-5-sonnet-20241022	✓	38.00%	$67.09	1	Download
2	Moatless	gpt-4o-2024-08-06	✓	29.80%	$79.84	1	Download
3	Agentless	o1-mini-2024-09-12	✓	27.20%	$366.81	1	Download

Accuracy vs. Cost Frontier for SWE-Bench Verified

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for SWE-Bench Verified

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Select Agent

Failure Categories

Distribution of Failures

Token Pricing Configuration

Adjust token prices to see how they affect the total cost calculations in the leaderboard and plots.

o1-mini-2024-09-12

Active

Input Token Price

/1M tokens

Output Token Price

/1M tokens

claude-3-5-sonnet-20241022

Active

Input Token Price

/1M tokens

Output Token Price

/1M tokens

gpt-4o-2024-08-06

Active

Input Token Price

/1M tokens

Output Token Price

/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on SWE-bench? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of SWE-bench tasks, including problem descriptions and test cases:

View Tasks