AgentHarm Benchmark

AgentHarm evaluates AI agents on their ability to identify and prevent potential harmful actions, focusing on safety and responsible behavior in various scenarios. It requires agents to execute multi-step tasks using external tools. The benchmark includes pairs of tasks across different harm categories (like fraud, cybercrime, and harassment) - each pair contains harmful versions and a benign counterpart.

Paper: AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents (Andriushchenko et al., 2024)

Behaviors

Key Features of AgentHarm

Safety Assessment

Evaluates agents' ability to identify and prevent potential harmful actions in various scenarios.

Refusal Analysis

Measures both appropriate refusal of harmful requests and avoidance of false positives.

Various Harm Categories

Tests across diverse harm categories including fraud, copyright, drugs, and disinformation among others.

Leaderboard (Harmful Tasks)

Results are evaluated on the public test set with attack technique set to 'No attack'.

Last updated: 2025-07-27 23:24 UTC

Rank	Agent	Models	Verified Verified Results Results have been reproduced by the HAL team	Harm Score Harm Score Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Refusals	Non-Refusal Harm Score	Cost (USD) Total Cost Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Runs Number of Runs The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name	Traces
1	Inspect Tool-Use Agent	Meta-Llama-3.1-405B-Instruct-Turbo	✓	7.48%	80.11%	37.63%	$1.03	1	Download
2	Inspect Tool-Use Agent	claude-3-5-sonnet-20241022	✓	11.64%	88.07%	92.02%	$1.72	1	Download
3	Inspect Tool-Use Agent	o1-2024-12-17 med.	✓	13.91%	75.57%	54.93%	$17.42	1	Download
4	Inspect Tool-Use Agent	o1-mini-2024-09-12	✓	14.96%	72.16%	49.85%	$2.30	1	Download
5	Inspect Tool-Use Agent	gpt-4o-2024-11-20	✓	46.62%	47.16%	82.63%	$1.39	1	Download
6	Inspect Tool-Use Agent	gpt-4o-mini-2024-07-18	✓	61.01%	16.48%	71.00%	$0.11	1	Download

Leaderboard (Benign)

Results are evaluated on the public test set with benign behaviors.

Last updated: 2025-07-27 23:24 UTC

Rank	Agent	Models	Verified Verified Results Results have been reproduced by the HAL team	Accuracy Accuracy Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Refusals	Non-Refusal Accuracy	Cost (USD) Total Cost Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Runs Number of Runs The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name	Traces
1	Inspect Tool-Use Agent	gpt-4o-2024-11-20	✓	82.71%	0.00%	82.71%	$2.27	1	Download
2	Inspect Tool-Use Agent	o1-2024-12-17 med.	✓	77.97%	3.98%	81.20%	$70.10	1	Download
3	Inspect Tool-Use Agent	claude-3-5-sonnet-20241022	✓	75.39%	7.39%	80.48%	$6.59	1	Download
4	Inspect Tool-Use Agent	gpt-4o-mini-2024-07-18	✓	74.08%	0.00%	74.08%	$0.14	1	Download
5	Inspect Tool-Use Agent	o1-mini-2024-09-12	✓	59.57%	5.11%	62.54%	$7.82	1	Download
6	Inspect Tool-Use Agent	Meta-Llama-3.1-405B-Instruct-Turbo	✓	27.02%	13.07%	31.09%	$2.46	1	Download

Refusal Heatmap (Harmful Tasks)

/1M tokens

Output Token Price

/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on AgentHarm? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of AgentHarm tasks, including safety scenarios and evaluation criteria:

View Tasks