Web browsers are a portal to the internet, where much of human activity is undertaken. Thus, there has been significant research work in AI agents that interact with the internet through web browsing. However, there is also another interface designed specifically for machine interaction with online content: application programming interfaces (APIs). In this paper we ask -- what if we were to take tasks traditionally tackled by browsing agents, and give AI agents access to APIs? To do so, we propose two varieties of agents: (1) an API-calling agent that attempts to perform online tasks through APIs only, similar to traditional coding agents, and (2) a hybrid agent that can interact with online data through both web browsing and APIs. In experiments on WebArena , a widely-used and realistic benchmark for web navigation tasks, we find that API-based agents outperform web browsing agents, as depicted in Figure 1. Hybrid Agents out-perform both others nearly uniformly across tasks, resulting in a more than 24.0% absolute improvement over web browsing alone, achieving a success rate of 38.9%, the SOTA performance among task-agnostic agents. These results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.

Background: Web Browsing

The Web Browsing Task

we focus on WebArena tasks, which simulate real-world scenarios to evaluate an agent's ability to complete diverse web-based activities. Tasks in WebArena include interacting with platforms like Gitlab (to manage projects and repositories), Reddit (to browse and post content), e-commerce websites (for shopping), and mapping services (for trip planning).

example — **Figure 2:** The API-based agent can often solve problems in many fewer function calls than traditional browsing agents. In this task, web browsing failed to solve the intent "find the number of commits the user `SaptakS` made to the repo `a11yproject`" after 15 steps, while our API-based agent successfully completed the task with only three lines of code.

Web Browsing Agent

A baseline web browsing agent leverages the accessibility tree of web pages, which organizes interactive elements like buttons, input fields, and links in a hierarchical structure. This structure makes it easier for agents to navigate the web by simulating human-like browsing behaviors such as clicking, filling out forms, and moving between pages. The agent maintains a comprehensive history of all its previous actions, allowing it to inform its future decisions based on past interactions. However, due to the complexity of some web elements and their dynamic nature, the browsing agent struggles with tasks requiring numerous or intricate interactions. For example, in the task in Figure 2, the agent needs to determine the number of commits made by a specific user to a project. A traditional browsing-based approach involves logging in, navigating to the correct project, and attempting to scroll through and find the user's commits, and thus the task becomes too complex and fails after the agent's 15-step limit.

From Web Browsing to API Calling

In contrast to web browsing, API calling offer a direct interface for machines to communicate with web services, reducing operational complexity.

APIs and API Documentation

Websites with API support offer pre-defined endpoints for efficient task execution using standardized protocols like REST. These APIs enable interactions via HTTP requests (e.g., GET, POST, PUT) and return structured data such as JSON objects. API documentation is typically provided in formats like README, OpenAPI YAML, or plain text, offering guidance on using the APIs. For instance, Figure 3 shows the official README documentation of a Gitlab API GET /api/{id}/commits. It documents the functionality, input arguments, and output types of the API. For example, one could use the Python requests library, by calling requests.get("gitlab.com/api/a11yproject/commits"), to retrieve all commits of the repository a11yproject. This would return a JSON list containing all the commits to this repo, as shown in Figure 3.

Transitioning from Web Browsing to API Calling

Approaches utilizing APIs is avoid complexity in web browsing, and can minimize the steps involved in complex workflows. For example, as discussed in Figure 2, to accomplish the task of finding the number of commits by SaptakS in a11yproject, a basic method for implementing API calling, such as CodeAct , could directly call the repository's commit history via a simple API request GET /api/a11yproject/commits and then filter the results to find the commits made by SaptakS. The task could completed in just three lines of Python code, showcasing the clear efficiency and precision that methods based on API calling could offer compared to web browsing.

Obtaining APIs for Agent Use

The process of acquiring APIs typically involves looking up official API documentation on a website. To maintain a structured and readable format of API documentations, we could utilize either OpenAPI YAML-style and README-style documentations. Some websites might offer official YAML oor README documentations of APIs, and in such cases, in the prompt we provide API documentations sourced directly from the public API documentation provided for the website. In other cases, where no official YAML or README documentations are provided, we could leverage LLMs like GPT-4o to generate these YAML or README files. By prompting GPT-4o with the relevant implementation details of the APIs (for example, the implementation of the APIs), we could generate comprehensive documentation, including input parameters, expected outputs, and example API calls, as depicted in Figure 3.

One-Stage Documentation for Small API Sets: For websites with a smaller number of API endpoints, we directly incorporate the full documentation into the prompt provided to the agent. Specifically, we use a threshold of 100 APIs, but this could be adjusted depending on the supported language model context size.

Two-Stage Documentation Retrieval for Large API Sets: For websites with a larger number of endpoints, providing the full documentation directly within the prompt was impractical due to the size limitations of agent inputs. To address this, we employed a two-stage documentation retrieval process, which allowed access to only the relevant information as needed, keeping the initial prompt concise. In the first stage, the user prompt could provide a description of the task, with a list of all available API endpoints along with a very brief description of each API. This initial summary helps facilitating understanding the scope of the available APIs while staying within the prompt size constraints. In the second stage, if the model determines that it needs detailed information about a specific API endpoint or some API endpoints, it can use a tool called get_api_documentation. This tool searches a dictionary that maps an API to its documentation, enabling the model to retrieve the full README or YAML documentation for any given endpoint by calling get_api_documentation with the endpoint's identifier. This might include the input parameters, output formats, and examples of how to interact with the endpoint. For example, to retrieve the documentation for the endpoint GET /api/{id}/commits, the agent would call get_api_documentation("GET /api/{id}/commits"), and an example returned API documentation is the documentation in Figure 3.

Experimental Setup

Websites	Gitlab	Map	Shopping	Admin	Reddit
Number of Endpoints	988	53	556	556	31
Doc Quality	Good	Good	Fair	Fair	Poor

Table 1: Number of endpoints, quality of API, and documentation quality for WebArena websites.

Here, we provide a detailed analysis of the API support for various websites used in the WebArena tasks, categorized into three levels: good, fair, and poor. The availability, functionality, and documentation of APIs, as described in Table 1, play a crucial role in the efficiency and flexibility of our agents.

Good API support: Gitlab and Map demonstrates good API support, since they provide significant coverage for the tasks in WebArena and maintain good API documentation.

Fair API support: The Shopping and Shopping Admin websites provide a reasonable level of support for common shopping tasks such as managing products, categories, and customer accounts, while some features are absent, such as the ability to add items to a wish list. The API documentation is fairly detailed and covers most core functionalities.

Poort API support: The Reddit tasks in WebArena are based on a self-hosted limited clone of the Reddit website, with limited functionalities as compared to the official site. It offers no API documentation, making it the least API-friendly website in the benchmark.

Evaluation Framework

We employed OpenHands as our primary evaluation framework for developing and testing our agents, as it is an open-source platform designed for creating AI agents that interact with both software and web environments . OpenHands supports a variety of agent interfaces, allowing agents to execute actions while maintaining a detailed record of past steps. For coding tasks, it incorporates CodeAct , providing a sandboxed bash environment and Jupyter IPython capabilities. Additionally, OpenHands includes a web navigation agent using BrowserGym, which operates within a Chromium web browser powered by Playwright . However, a limitation is that existing agents within OpenHands are specialized either in web browsing or coding/API-based interactions, but not both. To address this, we developed two new agents: an API-based agent that leverages the CodeAct framework and specialized instructions for API usage, and a hybrid agent that integrates web browsing functionalities with API calls. The hybrid agent dynamically selects between browsing actions and API calls based on task requirements, leveraging the current browser state, prior actions, and action outcomes. All agents use GPT-4o as the base large language model.

Results

Agents	Gitlab	Map	Shopping	Admin	Reddit	Multi	AVG.
WebArena Base	15.0	15.6	13.9	10.4	6.6	8.3	12.3
AutoEval	25.0	27.5	39.6	20.9	20.8	16.7	26.9
AWM	35.0	42.2	32.1	29.1	54.7	18.8	35.5
SteP	32.2	31.2	50.8	23.6	57.5	10.4	36.5
Browsing Agent	12.8	20.2	10.2	22.0	10.4	10.4	14.8
API-Based Agent	43.9	45.4	25.1	20.3	18.9	8.3	29.2
Hybrid Agent	44.4	45.9	25.7	41.2	51.9	16.7	38.9

Table 2: Performance of Agents across WebArena Websites. Note that SteP uses prompts inspired specifically by WebArena test set tasks, while other methods are task-agnostic. We achieve the highest performance among the task-agnostic agents.

The main results, shown in Table 2, highlight the performance of agents across websites in the WebArena benchmark. The API-Based Agent performed consistently well, achieving higher scores than the Browsing Agent due to its optimized API-calling capabilities, allowing it to efficiently complete tasks with minimal web browsing. The Browsing Agent, focused solely on navigating web interfaces, struggled in most domains. The Hybrid Agent, which integrates both API calling and web browsing, outperformed the API-Based and the Browsing agents in all categories. Its ability to switch dynamically between API and web interactions provided flexibility, enabling higher performance even in cases where APIs were incomplete or unavailable. By combining both approaches, the hybrid agent ensures success in situations where API-only methods fall short, using web browsing as a necessary backup. Therefore, the hybrid agent achieves sota performance among agents that are task-agnostic.

Does API Quality Matter?

The short answer is yes, API quality does significantly impact the performance of the API-based agent. High-quality APIs, with comprehensive, well-documented endpoints, enable efficient and accurate task execution. This allows the API-based agent to rely on fewer web browsing actions, while hybrid agents can switch more flexibly between methods. Websites like Gitlab and Map, which offer robust APIs, result in the highest task completion accuracy. Conversely, low-quality APIs hinder agent performance, forcing hybrid agents to rely more on web browsing. For instance, on Reddit, the initial API-based agent struggled due to limited API availability. After manually introducing additional APIs, performance improved from 9.43% to 18.9%, highlighting the importance of thorough API support.

Error Analysis

We randomly sampled 100 tasks from WebArena and performed error analysis on the API-Based Agent. Figure 4 shows the distribution of error categories among these 100 tasks. We found that 33% of the tasks are correctly performed with only API calling, 50% are unsolvable with solely APIs, 6% are incorrect due to incorrect task understanding, and 11% are incorrect due to error in calling APIs such as mal-formatting and wrong input. In other words, among the 50 API solvable tasks, 66% are performed correctly by the API-Based Agent. This showcases the strong capability of the API-Based Agent when given sufficient APIs to solve the task. Additionally, the average API calls required to solve API solvable tasks are 2.1 API calls, demonstrating how API calling could reduce operational complexity for web tasks. Although the API-Based Agent took an average of 7.8 steps to complete WebArena tasks, most of the steps were taken to retrieve API documentation, resolve errors from it's previous generations, or verify it's outputs.

case1 — **Figure 5:** Case 1: the hybrid agent succeeds while the browsing agent and API-based agent both fail

Case Study 1

One example where the hybrid agent succeeded, while both the API-based and browsing agents failed, involved a task from the Shopping Admin domain. The query was to "delete all negative reviews for Sybil running short," a product listed in the shopping admin interface. In this instance, the API-based agent failed because no relevant API endpoints were available for retrieving or deleting reviews. Similarly, the browsing agent failed, as completing this task purely through web navigation required too many steps, as depicted in Figure 5. This complexity made the task challenging for an agent relying solely on web interactions. However, the hybrid agent successfully completed the task by leveraging both API and browsing functionalities. An example trace of the hybrid agent shown in Figure 5. This case highlights the hybrid agent's ability to efficiently combine API calls with web interactions, allowing it to tackle complex multi-step tasks that would be difficult or impossible for solely browsing or solely API-based agents.

case2 — **Figure 6:** Case 2: the API-based agent succeeds while the browsing agent and the hybrid agent fails.

Case Study 2

Conversely, there are instances where the API-based agent outperforms the hybrid agent. One such case occurred in the GitLab website, as shown in Figure 6, where the task was to "tell me the email address of the contributor who has the most commits to ai". The API-based agent successfully completed this task by utilizing the GET /api/{id}/contributors API endpoint to retrieve the contributor with the highest number of commits and their associated email address. On the other hand, the hybrid agent attempted to solve the task through browsing but encountered significant challenges. Accessing this information through web browsing required navigating GitLab's interface, locating the correct repository and branch, and identifying the top contributor manually, a task that might be too difficult to perform through web navigation alone. As a result, both the browsing agent and the hybrid agent failed to complete the task. This case demonstrates an example where API access provides a more straightforward solution than browsing in contexts requiring structured data retrieval.

Beyond Browsing:

API-Based Web Agents