AppWorld Challenge

AppWorld is a set of complex day-to-day autonomous agent tasks, requiring interactive coding and API calls. Tasks are built on top of the AppWorld environment, which consists of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of ~100 people living in a simulated world. There are two test sets: Normal and Challenge that differ in their difficulty.

Paper: AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents (Trivedi et al., 2024)

417

Tasks

Agents Evaluated

Key Features of AppWorld

Complex Tasks

Challenges range from simple utilities to full-stack applications with complex business logic.

Multi-step Planning

Tests agents' ability to break down complex tasks and execute them in a logical sequence.

Best Practices

Evaluates adherence to software development best practices, code quality, and documentation.

AppWorld Challenge Leaderboard

Rank	Agent	Models	Verified Verified Results Results have been reproduced by the HAL team	Accuracy Accuracy Accuracy is referring to Task Goal Completion. Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Scenario Goal Completion Scenario Goal Completion The percentage of task scenarios for which the agent passed all evaluation tests for all tasks from that scenario.	Cost (USD) Total Cost Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Runs Number of Runs The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name	Traces
1	ReAct	gpt-4o-2024-05-13	✓	30.20%	13.00%	$451.24	1	Download
2	PlanExec	gpt-4o-2024-05-13	✓	19.70%	7.90%	$766.62	1	Download
3	FullCodeRefl	gpt-4o-2024-05-13	✓	19.20%	12.20%	$72.28	1	Download
4	IPFunCall	gpt-4o-2024-05-13	✓	18.00%	10.10%	$131.83	1	Download
5	ReAct	gpt-4-turbo-2024-04-09	✓	17.50%	5.80%	$887.70	1	Download
6	IPFunCall	gpt-4-turbo-2024-04-09	✓	14.60%	9.30%	$211.99	1	Download
7	FullCodeRefl	gpt-4-turbo-2024-04-09	✓	12.50%	7.20%	$148.94	1	Download
8	PlanExec	gpt-4-turbo-2024-04-09	✓	11.00%	3.60%	$1057.91	1	Download
9	FullCodeRefl	meta-llama/Llama-3-70b-chat-hf	✓	7.00%	4.30%	$36.52	1	Download
10	FullCodeRefl	deepseek-ai/deepseek-coder-33b-instruct	✓	5.80%	2.90%	$49.63	1	Download
11	ReAct	meta-llama/Llama-3-70b-chat-hf	✓	3.40%	0.00%	$69.46	1	Download
12	ReAct	deepseek-ai/deepseek-coder-33b-instruct	✓	2.90%	0.70%	$186.81	1	Download
13	PlanExec	meta-llama/Llama-3-70b-chat-hf	✓	2.40%	0.70%	$149.88	1	Download
14	PlanExec	deepseek-ai/deepseek-coder-33b-instruct	✓	0.70%	0.00%	$241.29	1	Download

Accuracy vs. Cost Frontier for AppWorld Challenge

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for AppWorld Challenge

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Select Agent

Failure Categories

Distribution of Failures

Additional Resources

Getting Started

Want to evaluate your agent on AppWorld Challenge? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of AppWorld Challenge tasks and their requirements:

View Tasks