SWE-Agent

Agent performance overview across all HAL benchmarks

1
Benchmarks
13
Models Used
4
Pareto Optimal Runs

Benchmark Performance

On the Pareto Frontier? indicates whether this agent achieved a Pareto-optimal trade-off between accuracy and cost on that benchmark. Agents on the Pareto frontier represent the current state-of-the-art efficiency for their performance level.

Benchmark Model Accuracy Cost On the Pareto Frontier?
Swebench Verified Mini
Claude Opus 4.1 (August 2025) 54.00% $1789.67 No
Swebench Verified Mini
Claude-3.7 Sonnet High (February 2025) 54.00% $388.88 No
Swebench Verified Mini
Claude Opus 4.1 High (August 2025) 54.00% $1599.90 No
Swebench Verified Mini
o4-mini Low (April 2025) 54.00% $259.20 Yes
Swebench Verified Mini
o4-mini High (April 2025) 50.00% $248.46 No
Swebench Verified Mini
Claude Opus 4 (May 2025) 50.00% $1330.90 No
Swebench Verified Mini
Claude-3.7 Sonnet (February 2025) 50.00% $402.69 No
Swebench Verified Mini
GPT-5 Medium (August 2025) 46.00% $162.93 Yes
Swebench Verified Mini
o3 Medium (April 2025) 46.00% $483.43 No
Swebench Verified Mini
GPT-4.1 (April 2025) 44.00% $393.65 No
Swebench Verified Mini
Gemini 2.0 Flash 24.00% $4.72 No
Swebench Verified Mini
DeepSeek V3 24.00% $2.10 Yes
Swebench Verified Mini
DeepSeek R1 0.00% $0.41 Yes