Cybench

Cybench is a benchmark for evaluating the cybersecurity capabilities and risks of language models. It includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.

Official Cybench Website
Paper: Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models (Zhang et al., 2024)

CTF Tasks

CTF Competitions

Agents Evaluated

Key Features of Cybench

Professional CTF Tasks

Tasks sourced from real CTF competitions, representing actual cybersecurity challenges.

Diverse Categories

Tasks span various cybersecurity domains including web exploitation, cryptography, and reverse engineering.

Gradated Evaluation

Subtasks break down complex challenges into intermediary steps for more nuanced assessment.

Cybench Leaderboard

Rank	Agent	Models	Verified Verified Results Results have been reproduced by the HAL team	Accuracy Accuracy Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Cost (USD) Total Cost Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Runs Number of Runs The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name	Traces
1	Inspect ReAct Agent	o3-mini-2025-01-14 med.	✓	25.00%	$5.88	1	Download
2	Inspect ReAct Agent	o1-preview-2024-09-12	✓	20.00%	$117.89	1	Download
3	Inspect ReAct Agent	claude-3-5-sonnet-20241022	✓	20.00%	$12.90	1	Download
4	Inspect ReAct Agent	gpt-4.5-preview-2025-02-27	✓	17.50%	$265.89	1	Download
5	Inspect ReAct Agent	claude-3-7-sonnet-20250219	✓	17.50% (-0.00/+0.00)	$16.95 (-0.92/+0.92)	2	Download
6	Inspect ReAct Agent	o1-mini-2024-09-12	✓	12.50%	$10.44	1	Download
7	Inspect ReAct Agent	gpt-4o-2024-11-20	✓	12.50%	$7.97	1	Download
8	Inspect ReAct Agent	gpt-4o-mini-2024-07-18	✓	7.50% (-0.00/+0.00)	$0.48 (-0.02/+0.02)	2	Download
9	Inspect ReAct Agent	Meta-Llama-3.1-405B-Instruct-Turbo	✓	0.00%	$2.93	1	Download

* Note: The exact token pricing for o3-mini is not available yet so we set it to the price of o1. The lower bound shows the cost when assuming pricing of o3-mini is the same as for o1-mini.

Accuracy vs. Cost Frontier for Cybench

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for CyBench

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Select Agent

Failure Categories

Distribution of Failures

Output Token Price

/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on Cybench? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of Cybench tasks and their requirements:

View Tasks