Cybench

Cybench is a benchmark for evaluating the cybersecurity capabilities and risks of language models. It includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.

Official Cybench Website
Paper: Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models (Zhang et al., 2024)

40
CTF Tasks
4
CTF Competitions
9
Agents Evaluated

Key Features of Cybench

Professional CTF Tasks

Tasks sourced from real CTF competitions, representing actual cybersecurity challenges.

Diverse Categories

Tasks span various cybersecurity domains including web exploitation, cryptography, and reverse engineering.

Gradated Evaluation

Subtasks break down complex challenges into intermediary steps for more nuanced assessment.

Cybench Leaderboard

Rank Agent Models Verified Accuracy Cost (USD) Runs Traces
1 o3-mini-2025-01-14 med. 25.00% $5.88 1 Download
2 o1-preview-2024-09-12 20.00% $117.89 1 Download
3 claude-3-5-sonnet-20241022 20.00% $12.90 1 Download
4 gpt-4.5-preview-2025-02-27 17.50% $265.89 1 Download
5 claude-3-7-sonnet-20250219 17.50% (-0.00/+0.00) $16.95 (-0.92/+0.92) 2 Download
6 o1-mini-2024-09-12 12.50% $10.44 1 Download
7 gpt-4o-2024-11-20 12.50% $7.97 1 Download
8 gpt-4o-mini-2024-07-18 7.50% (-0.00/+0.00) $0.48 (-0.02/+0.02) 2 Download
9 Meta-Llama-3.1-405B-Instruct-Turbo 0.00% $2.93 1 Download
* Note: The exact token pricing for o3-mini is not available yet so we set it to the price of o1. The lower bound shows the cost when assuming pricing of o3-mini is the same as for o1-mini.

Accuracy vs. Cost Frontier for Cybench

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for CyBench

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Failure Categories

Distribution of Failures

Token Pricing Configuration

Adjust token prices to see how they affect the total cost calculations in the leaderboard and plots.

together/meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo

Active
$
/1M tokens
$
/1M tokens

anthropic/claude-3-5-sonnet-20241022

Active
$
/1M tokens
$
/1M tokens

anthropic/claude-3-7-sonnet-20250219

Active
$
/1M tokens
$
/1M tokens

openai/gpt-4.5-preview-2025-02-27

Active
$
/1M tokens
$
/1M tokens

openai/gpt-4o-2024-11-20

Active
$
/1M tokens
$
/1M tokens

openai/gpt-4o-mini-2024-07-18

Active
$
/1M tokens
$
/1M tokens

openai/o1-mini-2024-09-12

Active
$
/1M tokens
$
/1M tokens

openai/o1-preview-2024-09-12

Active
$
/1M tokens
$
/1M tokens

openai/o3-mini-2025-01-14

Active
$
/1M tokens
$
/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on Cybench? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of Cybench tasks and their requirements:

View Tasks