AppWorld Normal

AppWorld is a set of complex day-to-day autonomous agent tasks, requiring interactive coding and API calls. Tasks are built on top of the AppWorld environment, which consists of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of ~100 people living in a simulated world. There are two test sets: Normal and Challenge that differ in their difficulty.

Paper: AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents (Trivedi et al., 2024)

168

Tasks

Agents Evaluated

Key Features of AppWorld

Interactive Environment

Agents interact with a controlled world of apps and people, simulating real-world development scenarios.

Two Task Categories

Normal tasks represent typical development scenarios, while Challenge tasks push the boundaries of AI capabilities.

Comprehensive Testing

Each task evaluates multiple aspects: code correctness, contextual understanding, and interaction capabilities.

AppWorld Normal Leaderboard

Rank	Agent	Models	Verified Verified Results Results have been reproduced by the HAL team	Accuracy Accuracy Accuracy is referring to Task Goal Completiion. Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Scenario Goal Completion Scenario Goal Completion The percentage of task scenarios for which the agent passed all evaluation tests for all tasks from that scenario.	Cost (USD) Total Cost Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available	Runs Number of Runs The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name	Traces
1	ReAct	gpt-4o-2024-05-13	✓	48.80%	32.10%	$121.69	1	Download
2	PlanExec	gpt-4o-2024-05-13	✓	44.60%	23.20%	$225.06	1	Download
3	FullCodeRefl	gpt-4o-2024-05-13	✓	33.90%	26.80%	$19.78	1	Download
4	PlanExec	gpt-4-turbo-2024-04-09	✓	32.70%	16.10%	$291.37	1	Download
5	IPFunCall	gpt-4o-2024-05-13	✓	32.10%	16.10%	$51.28	1	Download
6	IPFunCall	gpt-4-turbo-2024-04-09	✓	30.40%	21.40%	$75.27	1	Download
7	ReAct	gpt-4-turbo-2024-04-09	✓	26.80%	12.50%	$242.14	1	Download
8	FullCodeRefl	gpt-4-turbo-2024-04-09	✓	25.60%	19.60%	$35.41	1	Download
9	FullCodeRefl	meta-llama/Llama-3-70b-chat-hf	✓	24.40%	17.90%	$12.47	1	Download
10	ReAct	meta-llama/Llama-3-70b-chat-hf	✓	20.80%	8.90%	$18.23	1	Download
11	FullCodeRefl	deepseek-ai/deepseek-coder-33b-instruct	✓	13.10%	8.90%	$14.34	1	Download
12	PlanExec	meta-llama/Llama-3-70b-chat-hf	✓	8.90%	1.80%	$38.42	1	Download
13	ReAct	deepseek-ai/deepseek-coder-33b-instruct	✓	7.10%	1.80%	$62.95	1	Download
14	PlanExec	deepseek-ai/deepseek-coder-33b-instruct	✓	1.80%	0.00%	$122.15	1	Download

Accuracy vs. Cost Frontier for AppWorld Normal

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for AppWorld Normal

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Select Agent

Failure Categories

Distribution of Failures

Additional Resources

Getting Started

Want to evaluate your agent on AppWorld? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of AppWorld tasks, including both normal and challenge categories:

View Tasks