SWE-Agent
Agent performance overview across all HAL benchmarks
1
Benchmarks
13
Models Used
4
Pareto Optimal Runs
Models Used
Claude Opus 4.1 (August 2025)
Claude-3.7 Sonnet High (February 2025)
Claude Opus 4.1 High (August 2025)
o4-mini Low (April 2025)
o4-mini High (April 2025)
Claude Opus 4 (May 2025)
Claude-3.7 Sonnet (February 2025)
GPT-5 Medium (August 2025)
o3 Medium (April 2025)
GPT-4.1 (April 2025)
Gemini 2.0 Flash
DeepSeek V3
DeepSeek R1
Benchmark Performance
On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.
Benchmark | Model | Accuracy | Cost | On the Pareto Frontier? |
---|---|---|---|---|
Swebench Verified Mini
|
Claude Opus 4.1 (August 2025) | 54.00% | $1789.67 | No |
Swebench Verified Mini
|
Claude-3.7 Sonnet High (February 2025) | 54.00% | $388.88 | No |
Swebench Verified Mini
|
Claude Opus 4.1 High (August 2025) | 54.00% | $1599.90 | No |
Swebench Verified Mini
|
o4-mini Low (April 2025) | 54.00% | $259.20 | Yes |
Swebench Verified Mini
|
o4-mini High (April 2025) | 50.00% | $248.46 | No |
Swebench Verified Mini
|
Claude Opus 4 (May 2025) | 50.00% | $1330.90 | No |
Swebench Verified Mini
|
Claude-3.7 Sonnet (February 2025) | 50.00% | $402.69 | No |
Swebench Verified Mini
|
GPT-5 Medium (August 2025) | 46.00% | $162.93 | Yes |
Swebench Verified Mini
|
o3 Medium (April 2025) | 46.00% | $483.43 | No |
Swebench Verified Mini
|
GPT-4.1 (April 2025) | 44.00% | $393.65 | No |
Swebench Verified Mini
|
Gemini 2.0 Flash | 24.00% | $4.72 | No |
Swebench Verified Mini
|
DeepSeek V3 | 24.00% | $2.10 | Yes |
Swebench Verified Mini
|
DeepSeek R1 | 0.00% | $0.41 | Yes |