AppWorld Normal

AppWorld is a set of complex day-to-day autonomous agent tasks, requiring interactive coding and API calls. Tasks are built on top of the AppWorld environment, which consists of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of ~100 people living in a simulated world. There are two test sets: Normal and Challenge that differ in their difficulty.

Paper: AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents (Trivedi et al., 2024)

168
Tasks
14
Agents Evaluated

Key Features of AppWorld

Interactive Environment

Agents interact with a controlled world of apps and people, simulating real-world development scenarios.

Two Task Categories

Normal tasks represent typical development scenarios, while Challenge tasks push the boundaries of AI capabilities.

Comprehensive Testing

Each task evaluates multiple aspects: code correctness, contextual understanding, and interaction capabilities.

AppWorld Normal Leaderboard

Rank Agent Models Verified Accuracy Scenario Goal Completion Cost (USD) Runs Traces
1 gpt-4o-2024-05-13 48.80% 32.10% $121.69 1 Download
2 gpt-4o-2024-05-13 44.60% 23.20% $225.06 1 Download
3 gpt-4o-2024-05-13 33.90% 26.80% $19.78 1 Download
4 gpt-4-turbo-2024-04-09 32.70% 16.10% $291.37 1 Download
5 gpt-4o-2024-05-13 32.10% 16.10% $51.28 1 Download
6 gpt-4-turbo-2024-04-09 30.40% 21.40% $75.27 1 Download
7 gpt-4-turbo-2024-04-09 26.80% 12.50% $242.14 1 Download
8 gpt-4-turbo-2024-04-09 25.60% 19.60% $35.41 1 Download
9 meta-llama/Llama-3-70b-chat-hf 24.40% 17.90% $12.47 1 Download
10 meta-llama/Llama-3-70b-chat-hf 20.80% 8.90% $18.23 1 Download
11 deepseek-ai/deepseek-coder-33b-instruct 13.10% 8.90% $14.34 1 Download
12 meta-llama/Llama-3-70b-chat-hf 8.90% 1.80% $38.42 1 Download
13 deepseek-ai/deepseek-coder-33b-instruct 7.10% 1.80% $62.95 1 Download
14 deepseek-ai/deepseek-coder-33b-instruct 1.80% 0.00% $122.15 1 Download

Accuracy vs. Cost Frontier for AppWorld Normal

This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.

Heatmap for AppWorld Normal

The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.

Failure Analysis (Experimental)

Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.

Failure Categories

Distribution of Failures

Additional Resources

Getting Started

Want to evaluate your agent on AppWorld? Follow our comprehensive guide to get started:

View Documentation

Task Details

Browse the complete list of AppWorld tasks, including both normal and challenge categories:

View Tasks