AgentHarm Benchmark

AgentHarm evaluates AI agents on their ability to identify and prevent potential harmful actions, focusing on safety and responsible behavior in various scenarios. It requires agents to execute multi-step tasks using external tools. The benchmark includes pairs of tasks across different harm categories (like fraud, cybercrime, and harassment) - each pair contains harmful versions and a benign counterpart.

Paper: AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents (Andriushchenko et al., 2024)

44
Behaviors
8
Categories
6
Agents Evaluated

Key Features of AgentHarm

Safety Assessment

Evaluates agents' ability to identify and prevent potential harmful actions in various scenarios.

Refusal Analysis

Measures both appropriate refusal of harmful requests and avoidance of false positives.

Various Harm Categories

Tests across diverse harm categories including fraud, copyright, drugs, and disinformation among others.

Leaderboard (Harmful Tasks)

Results are evaluated on the public test set with attack technique set to 'No attack'.
Last updated: 2025-04-27 09:05 UTC
Rank Agent Models Verified Harm Score Refusals Non-Refusal Harm Score Cost (USD) Runs Traces
1 Meta-Llama-3.1-405B-Instruct-Turbo 7.48% 80.11% 37.63% $1.03 1 Download
2 claude-3-5-sonnet-20241022 11.64% 88.07% 92.02% $1.72 1 Download
3 o1-2024-12-17 med. 13.91% 75.57% 54.93% $17.42 1 Download
4 o1-mini-2024-09-12 14.96% 72.16% 49.85% $2.30 1 Download
5 gpt-4o-2024-11-20 46.62% 47.16% 82.63% $1.39 1 Download
6 gpt-4o-mini-2024-07-18 61.01% 16.48% 71.00% $0.11 1 Download

Leaderboard (Benign)

Results are evaluated on the public test set with benign behaviors.
Last updated: 2025-04-27 09:05 UTC
Rank Agent Models Verified Accuracy Refusals Non-Refusal Accuracy Cost (USD) Runs Traces
1 gpt-4o-2024-11-20 82.71% 0.00% 82.71% $2.27 1 Download
2 o1-2024-12-17 med. 77.97% 3.98% 81.20% $70.10 1 Download
3 claude-3-5-sonnet-20241022 75.39% 7.39% 80.48% $6.59 1 Download
4 gpt-4o-mini-2024-07-18 74.08% 0.00% 74.08% $0.14 1 Download
5 o1-mini-2024-09-12 59.57% 5.11% 62.54% $7.82 1 Download
6 Meta-Llama-3.1-405B-Instruct-Turbo 27.02% 13.07% 31.09% $2.46 1 Download

Refusal Heatmap (Harmful Tasks)

Unlike other benchmarks that show success rates, this heatmap visualizes refusal rates across different task categories and difficulty levels, helping identify where agents appropriately refuse potentially harmful tasks.

Token Pricing Configuration

Adjust token prices to see how they affect the total cost calculations in the leaderboard and plots.

anthropic/claude-3-5-sonnet-20241022

Active
$
/1M tokens
$
/1M tokens

openai/o1-2024-12-17

Active
$
/1M tokens
$
/1M tokens

openai/gpt-4o-2024-11-20

Active
$
/1M tokens
$
/1M tokens

together/meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo

Active
$
/1M tokens
$
/1M tokens

openai/gpt-4o-mini-2024-07-18

Active
$
/1M tokens
$
/1M tokens

openai/o1-mini-2024-09-12

Active
$
/1M tokens
$
/1M tokens

Additional Resources

Getting Started

Want to evaluate your agent on AgentHarm? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of AgentHarm tasks, including safety scenarios and evaluation criteria:

View Tasks