AgentHarm Benchmark
AgentHarm evaluates AI agents on their ability to identify and prevent potential harmful actions, focusing on safety and responsible behavior in various scenarios. It requires agents to execute multi-step tasks using external tools. The benchmark includes pairs of tasks across different harm categories (like fraud, cybercrime, and harassment) - each pair contains harmful versions and a benign counterpart.
Paper: AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents (Andriushchenko et al., 2024)
Key Features of AgentHarm
Safety Assessment
Evaluates agents' ability to identify and prevent potential harmful actions in various scenarios.
Refusal Analysis
Measures both appropriate refusal of harmful requests and avoidance of false positives.
Various Harm Categories
Tests across diverse harm categories including fraud, copyright, drugs, and disinformation among others.
Leaderboard (Harmful Tasks)
Rank | Agent | Models |
Verified
Verified Results
Results have been reproduced by the HAL team |
Harm Score
Harm Score
Confidence intervals show the min-max values across runs for those agents where multiple runs are available |
Refusals | Non-Refusal Harm Score |
Cost (USD)
Total Cost
Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available |
Runs
Number of Runs
The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name |
Traces |
---|---|---|---|---|---|---|---|---|---|
1 | Meta-Llama-3.1-405B-Instruct-Turbo | ✓ | 7.48% | 80.11% | 37.63% | $1.03 | 1 | Download | |
2 | claude-3-5-sonnet-20241022 | ✓ | 11.64% | 88.07% | 92.02% | $1.72 | 1 | Download | |
3 | o1-2024-12-17 med. | ✓ | 13.91% | 75.57% | 54.93% | $17.42 | 1 | Download | |
4 | o1-mini-2024-09-12 | ✓ | 14.96% | 72.16% | 49.85% | $2.30 | 1 | Download | |
5 | gpt-4o-2024-11-20 | ✓ | 46.62% | 47.16% | 82.63% | $1.39 | 1 | Download | |
6 | gpt-4o-mini-2024-07-18 | ✓ | 61.01% | 16.48% | 71.00% | $0.11 | 1 | Download |
Leaderboard (Benign)
Rank | Agent | Models |
Verified
Verified Results
Results have been reproduced by the HAL team |
Accuracy
Accuracy
Confidence intervals show the min-max values across runs for those agents where multiple runs are available |
Refusals | Non-Refusal Accuracy |
Cost (USD)
Total Cost
Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available |
Runs
Number of Runs
The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name |
Traces |
---|---|---|---|---|---|---|---|---|---|
1 | gpt-4o-2024-11-20 | ✓ | 82.71% | 0.00% | 82.71% | $2.27 | 1 | Download | |
2 | o1-2024-12-17 med. | ✓ | 77.97% | 3.98% | 81.20% | $70.10 | 1 | Download | |
3 | claude-3-5-sonnet-20241022 | ✓ | 75.39% | 7.39% | 80.48% | $6.59 | 1 | Download | |
4 | gpt-4o-mini-2024-07-18 | ✓ | 74.08% | 0.00% | 74.08% | $0.14 | 1 | Download | |
5 | o1-mini-2024-09-12 | ✓ | 59.57% | 5.11% | 62.54% | $7.82 | 1 | Download | |
6 | Meta-Llama-3.1-405B-Instruct-Turbo | ✓ | 27.02% | 13.07% | 31.09% | $2.46 | 1 | Download |
Refusal Heatmap (Harmful Tasks)
Unlike other benchmarks that show success rates, this heatmap visualizes refusal rates across different task categories and difficulty levels, helping identify where agents appropriately refuse potentially harmful tasks.
Token Pricing Configuration
Adjust token prices to see how they affect the total cost calculations in the leaderboard and plots.
Additional Resources
Getting Started
Want to evaluate your agent on AgentHarm? Follow our comprehensive guide to get started:
View DocumentationTask Details
Browse the complete list of AgentHarm tasks, including safety scenarios and evaluation criteria:
View Tasks