AppWorld Challenge
AppWorld is a set of complex day-to-day autonomous agent tasks, requiring interactive coding and API calls. Tasks are built on top of the AppWorld environment, which consists of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of ~100 people living in a simulated world. There are two test sets: Normal and Challenge that differ in their difficulty.
Key Features of AppWorld
Complex Tasks
Challenges range from simple utilities to full-stack applications with complex business logic.
Multi-step Planning
Tests agents' ability to break down complex tasks and execute them in a logical sequence.
Best Practices
Evaluates adherence to software development best practices, code quality, and documentation.
AppWorld Challenge Leaderboard
Rank | Agent | Models |
Verified
Verified Results
Results have been reproduced by the HAL team |
Accuracy
Accuracy
Accuracy is referring to Task Goal Completion. Confidence intervals show the min-max values across runs for those agents where multiple runs are available |
Scenario Goal Completion
Scenario Goal Completion
The percentage of task scenarios for which the agent passed all evaluation tests for all tasks from that scenario. |
Cost (USD)
Total Cost
Total API cost for running the agent on all tasks. Confidence intervals show the min-max values across runs for those agents where multiple runs are available |
Runs
Number of Runs
The number of runs for this agent submitted to the leaderboard. To submit multiple evaluations, rerun the same agent and set the same agent name |
Traces |
---|---|---|---|---|---|---|---|---|
1 | gpt-4o-2024-05-13 | ✓ | 30.20% | 13.00% | $451.24 | 1 | Download | |
2 | gpt-4o-2024-05-13 | ✓ | 19.70% | 7.90% | $766.62 | 1 | Download | |
3 | gpt-4o-2024-05-13 | ✓ | 19.20% | 12.20% | $72.28 | 1 | Download | |
4 | gpt-4o-2024-05-13 | ✓ | 18.00% | 10.10% | $131.83 | 1 | Download | |
5 | gpt-4-turbo-2024-04-09 | ✓ | 17.50% | 5.80% | $887.70 | 1 | Download | |
6 | gpt-4-turbo-2024-04-09 | ✓ | 14.60% | 9.30% | $211.99 | 1 | Download | |
7 | gpt-4-turbo-2024-04-09 | ✓ | 12.50% | 7.20% | $148.94 | 1 | Download | |
8 | gpt-4-turbo-2024-04-09 | ✓ | 11.00% | 3.60% | $1057.91 | 1 | Download | |
9 | meta-llama/Llama-3-70b-chat-hf | ✓ | 7.00% | 4.30% | $36.52 | 1 | Download | |
10 | deepseek-ai/deepseek-coder-33b-instruct | ✓ | 5.80% | 2.90% | $49.63 | 1 | Download | |
11 | meta-llama/Llama-3-70b-chat-hf | ✓ | 3.40% | 0.00% | $69.46 | 1 | Download | |
12 | deepseek-ai/deepseek-coder-33b-instruct | ✓ | 2.90% | 0.70% | $186.81 | 1 | Download | |
13 | meta-llama/Llama-3-70b-chat-hf | ✓ | 2.40% | 0.70% | $149.88 | 1 | Download | |
14 | deepseek-ai/deepseek-coder-33b-instruct | ✓ | 0.70% | 0.00% | $241.29 | 1 | Download |
Accuracy vs. Cost Frontier for AppWorld Challenge
This plot shows the relationship between an agent's performance and its token cost. The Pareto frontier (dashed line) represents the current state-of-the-art trade-off. The error bars indicate min-max values across runs.
Heatmap for AppWorld Challenge
The heatmap visualizes success rates across tasks and agents. Colorscale shows the fraction of times a task was solved across reruns of the same agent. The "any agent" performance indicates the level of saturation of the benchmark and gives a sense of overall progress.
Failure Analysis (Experimental)
Select an agent to see a detailed breakdown of failure categories and their descriptions. This analysis helps understand common failure patterns and areas for improvement. Failure reports are usually available for the top 2 agents.
Failure Categories
Distribution of Failures
Additional Resources
Getting Started
Want to evaluate your agent on AppWorld Challenge? Follow our comprehensive guide to get started:
View DocumentationTask Details
Browse the complete list of AppWorld Challenge tasks and their requirements:
View Tasks