AppWorld Normal
AppWorld is a set of complex day-to-day autonomous agent tasks, requiring interactive coding and API calls. Tasks are built on top of the AppWorld environment, which consists of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of ~100 people living in a simulated world. There are two test sets: Normal and Challenge that differ in their difficulty.
Key Features of AppWorld
Interactive Environment
Agents interact with a controlled world of apps and people, simulating real-world development scenarios.
Two Task Categories
Normal tasks represent typical development scenarios, while Challenge tasks push the boundaries of AI capabilities.
Comprehensive Testing
Each task evaluates multiple aspects: code correctness, contextual understanding, and interaction capabilities.
AppWorld Normal Leaderboard
Rank | Agent | Models |
Verified
Verified Results
Results have been reproduced by the HAL team |
Accuracy
Accuracy
Accuracy is referring to Task Goal Completiion. Confidence intervals show the min-max values across runs for those agents where multiple runs are available |
Scenario Goal Completion
Scenario Goal Completion
The percentage of task scenarios for which the agent passed all evaluation tests for all tasks from that scenario. |
Cost (USD)
Total Cost
Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available |
Runs
Number of Runs
The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name |
Traces |
---|---|---|---|---|---|---|---|---|
1 | gpt-4o-2024-05-13 | ✓ | 48.80% | 32.10% | $121.69 | 1 | Download | |
2 | gpt-4o-2024-05-13 | ✓ | 44.60% | 23.20% | $225.06 | 1 | Download | |
3 | gpt-4o-2024-05-13 | ✓ | 33.90% | 26.80% | $19.78 | 1 | Download | |
4 | gpt-4-turbo-2024-04-09 | ✓ | 32.70% | 16.10% | $291.37 | 1 | Download | |
5 | gpt-4o-2024-05-13 | ✓ | 32.10% | 16.10% | $51.28 | 1 | Download | |
6 | gpt-4-turbo-2024-04-09 | ✓ | 30.40% | 21.40% | $75.27 | 1 | Download | |
7 | gpt-4-turbo-2024-04-09 | ✓ | 26.80% | 12.50% | $242.14 | 1 | Download | |
8 | gpt-4-turbo-2024-04-09 | ✓ | 25.60% | 19.60% | $35.41 | 1 | Download | |
9 | meta-llama/Llama-3-70b-chat-hf | ✓ | 24.40% | 17.90% | $12.47 | 1 | Download | |
10 | meta-llama/Llama-3-70b-chat-hf | ✓ | 20.80% | 8.90% | $18.23 | 1 | Download | |
11 | deepseek-ai/deepseek-coder-33b-instruct | ✓ | 13.10% | 8.90% | $14.34 | 1 | Download | |
12 | meta-llama/Llama-3-70b-chat-hf | ✓ | 8.90% | 1.80% | $38.42 | 1 | Download | |
13 | deepseek-ai/deepseek-coder-33b-instruct | ✓ | 7.10% | 1.80% | $62.95 | 1 | Download | |
14 | deepseek-ai/deepseek-coder-33b-instruct | ✓ | 1.80% | 0.00% | $122.15 | 1 | Download |
Accuracy vs. Cost Frontier for AppWorld Normal
This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.
Heatmap for AppWorld Normal
The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.
Failure Analysis (Experimental)
Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.
Failure Categories
Distribution of Failures
Additional Resources
Getting Started
Want to evaluate your agent on AppWorld? Follow our comprehensive guide to get started:
View DocumentationTask Details
Browse the complete list of AppWorld tasks, including both normal and challenge categories:
View Tasks