AppWorld Challenge

AppWorld is a set of complex day-to-day autonomous agent tasks, requiring interactive coding and API calls. Tasks are built on top of the AppWorld environment, which consists of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of ~100 people living in a simulated world. There are two test sets: Normal and Challenge that differ in their difficulty.

Paper: AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents (Trivedi et al., 2024)

417
Tasks
14
Agents Evaluated

Key Features of AppWorld

Complex Tasks

Challenges range from simple utilities to full-stack applications with complex business logic.

Multi-step Planning

Tests agents' ability to break down complex tasks and execute them in a logical sequence.

Best Practices

Evaluates adherence to software development best practices, code quality, and documentation.

AppWorld Challenge Leaderboard

Rank Agent Models Verified Accuracy Scenario Goal Completion Cost (USD) Runs Traces
1 gpt-4o-2024-05-13 30.20% 13.00% $451.24 1 Download
2 gpt-4o-2024-05-13 19.70% 7.90% $766.62 1 Download
3 gpt-4o-2024-05-13 19.20% 12.20% $72.28 1 Download
4 gpt-4o-2024-05-13 18.00% 10.10% $131.83 1 Download
5 gpt-4-turbo-2024-04-09 17.50% 5.80% $887.70 1 Download
6 gpt-4-turbo-2024-04-09 14.60% 9.30% $211.99 1 Download
7 gpt-4-turbo-2024-04-09 12.50% 7.20% $148.94 1 Download
8 gpt-4-turbo-2024-04-09 11.00% 3.60% $1057.91 1 Download
9 meta-llama/Llama-3-70b-chat-hf 7.00% 4.30% $36.52 1 Download
10 deepseek-ai/deepseek-coder-33b-instruct 5.80% 2.90% $49.63 1 Download
11 meta-llama/Llama-3-70b-chat-hf 3.40% 0.00% $69.46 1 Download
12 deepseek-ai/deepseek-coder-33b-instruct 2.90% 0.70% $186.81 1 Download
13 meta-llama/Llama-3-70b-chat-hf 2.40% 0.70% $149.88 1 Download
14 deepseek-ai/deepseek-coder-33b-instruct 0.70% 0.00% $241.29 1 Download

Accuracy vs. Cost Frontier for AppWorld Challenge

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for AppWorld Challenge

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Failure Categories

Distribution of Failures

Additional Resources

Getting Started

Want to evaluate your agent on AppWorld Challenge? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of AppWorld Challenge tasks and their requirements:

View Tasks