Understanding AI Agents in WebArena: A Deep Dive into Browser-Based Training Environments

Most of us would have seen Open AI's Operator demo and some of us even trying to understand how it works. Any material you read about AI Agents will invariably point to their performance in environments like OS World, WebArena or WebVoyager.

Pic Source : From Open AI's Website, Computer Using Agents.

I was curious about these environments, their purpose, functioning and found the entire concept, design and working of these systems cool. It gives a purview into different methods used today to evaluate Agent performance. In this three part series, I am exploring the inner workings of these environments, starting with WebArena.

WebArena 🌐

Agents are usually built and tested in synthetic environments i.e, a made up environments. Agents trained in synthetic environments are disconnected from real world situations, because the simple nature of these environments lack task diversity, making them unreliable on performing a given task.

Creating the Environment :

WebArena is an environment built for Language guided agents to perform tasks in a highly realistic and reproducible way. WebArena is different from synthetic environments because it is an exact replica of fully functional websites picked up across four common domains with massive human interaction everyday: E-commerce (E.g.,Amazon, Ebay), Social Media forum discussions (E.g., Reddit, StackExchange) , Collaborative software development (E.g., GitLab), and content management systems that manages digital content ( E.g., Wordpress CMS).

WebArena is also equipped with “external tools” like scratchpad, calculator, maps and access to user manuals, knowledge base (Like Wiki) to encourage human-like task solving capabilities and research. WebArena helps to benchmark the functional correctness of an agent in performing a specific task.

At a very high level, A model or LLM navigating through WebArena will experience 4 different layers namely, State Space, Action Space, Observation Space and Transition Space.

State Space :

State space is a representation of all possible states the agent can experience while solving a problem. In this context, the websites hosted in WebArena with all their actions, webpages become different states. For better understanding of how much data each Website replica has, please refer to the implementation methodology below. I've highlighted important information if you are the curious type, (Skip the below image if not required)

Website Implementation methodology. Source : WebArena Paper

Observation Space : The Observation space is what the Agent sees while looking at the current browser session. Agent here is the LLM with access to tools and external date executing a command on the webpage. A typical browser session will have multiple open tabs, each tab containing a URL and contents of the focused webpage. When the agent looks at a webpage, it renders three different formats for the focused webpage for its understanding.

A Screenshot of the website represented in a pixel representation [RGB Format]
A HTML DOM Tree.
DOM - Accessibility Tree.

Example of how The Observation space looks for the LLM inside WebArena.

The tabs are made to fit within the viewport i.e viewing space of the webpage for the LLM's context window [ What the LLM can see and process in a single go].

Action Space : The Action Space contains all the "actions" or steps the agent takes once it reaches a decision after internal reasoning. We covered how an agent works and makes decision in our previous newsletter, the Action space is designed to emulate the keyboard and mouse actions available in the webpage. All the actions thus collected are grouped into three categories as shown in the image below. They are Element Operations, Tab-Related Actions and URL Navigations Actions.

Action Space is classified into Element Operations, Tab related actions and URL Actions

WebArena uses these actions to interact with different elements of a webpage like buttons, URLs etc with ease. Any Element can be selected based on its on-screen coordinates, (x,y) or by by assigning a unique ID as prefix. The ID is generated when the LLM goes through the DOM file or accessibility tree. Using ID's makes it easier to sort and call any action.

In the example below, prefix ID is added in the different elements of a Github page in WebArena. The Search box [Which is an element] has a prefix number [2430]. The LLM can issue action to the Prefix number to use the Search Box. This method also reduces error when working with multiple sites which can have the same elements. Example "Add to Cart" or "Buy" or "Wishlist" can be common elements in all shopping websites.

Each Element is Prepended with an ID in the DOM File to make the action easier. Source : Ref 1

Transition Space : This is the rule that tells us how the current state changes based on the action the agent takes. It predicts what the new situation will be after the action.

Choosing the Tasks to execute :

WebArena is created to observe how an Agent executes a task, so defining and choosing the task is very important. Each task must have a path or metric that we can use later to traceback the agent's efficiency in performing the task.

WebArena appointed research graduate students to spend time exploring the website models and familiarise themselves with its content and functionalities. They were asked to come up with real life situations, interactions people do on these websites regularly and also identify the real intent behind each task. The intent of the tasks thus outlined must meet the following criteria.

The Intent of the task should be abstract and high-level i.e, it should have multiple actions.
The Intent should be creative.
The Intent should be formulated as a template by making replaceable elements as variables, i.e the template must work for different sites and conditions.

Note : It's interesting that researchers used Chat GPT to find tasks and actions from an image.

Intent Analysis : In total, the research team curated 241 templates and 812 actual tasks. The intent thus created are again classified into three types,

Any Task given in WebArena should have an intent. The intent should be abstract and high level, creative and reutilised as templates for different context.

Play

List of tasks on which the LLM is evaluated on. Source :

Based on this, the researchers curated 241 templates and 812 instantiated intents across different website replicas.

Evaluation Criteria :

A grading system is introduced to check if the output given by the agent is accurate. Different evaluation methods are used for tasks with varied intent.

Information Seeking tasks give a text response as an answer. The text responses are matched to the answer key for accuracy.

Exact Match : The Agent is expected to give a 100% identical answer.
Must Include : Is like going through a checklist, the Agent must hit the right key points.
Fuzzy Match : GPT-4 is used to check if the given answer is semantically correct/ incorrect.

2. Evaluating Site Navigation and Content & Config Tasks : The tasks in these categories require accessing web pages that meet certain conditions or performing operations that modify the underlying data storage of the respective websites.

Before the Agent starts executing a task, it evaluates the input, uses different tools at disposal, reads the current browser page and starts "thinking", also called "Internal reasoning" "machine monologue" etc, made possible through cognitive architectures. More about that later. The internal reasoning thus created is matched to the execution trajectory to check if the path taken by the agent is intact.

Tool used :

JavaScript code to scrape webpage content for the Agent to read the screen.
Database checks like peeking into the website’s "brain" to see if data was saved correctly.

3. Unachievable tasks :

Unlike LLMs, known for their notoriety to hallucinate and make up answers, Agents are domain based and have access to selective knowledge. Agents may face constraints such as inadequate evidence, user permissions, or the absence of necessary functional support on the website, humans may ask for tasks that are not possible to complete. These impractical situations are labelled as Unachievable where the Agent returns with a N/A "Not Applicable". This ensures the Agent's adherence to factual accuracy.

The Answer Key:

Step 1 : For every task, an answer key is prepared to ascertain the functional correctness or the accuracy of the Agent.

For Information seeking tasks - The answers are usually a text or one line description.

For Site Navigation and Configuration - An execution Trajectory or a map of navigation to perform a task is created, think of it as a recipe with instructions.

Step 2: Two people review each answer based on the task type. If they disagree, a third person decides (like judges in a cooking competition)
Step 3 : They test the tasks themselves to make sure they’re fair and doable.

Human Performance :

Five computer science graduates were given 170 Task templates from the sample to execute and they were successful in completing 74.24% of the tasks.

Top reasons where the humans failed at executing a specific task

50% of the failures are due to misinterpreting the intent
Incomplete answers
Incomplete executions

Results :

Even the best model they used GPT -4 only succeeded in completing 14.41% of tasks assigned to it. The major reasons for execution failure is attributed to

The models stopping much earlier concluding that a task is unachievable way to soon. Example : GPT-4 flagged 54.9% of feasible tasks as impossible. Sometimes, this "UA" intent picked up from the instruction itself. To mitigate this, the researchers had to tweak the prompt into a step function (Chain of Thought approach) or recommend "stop if stuck" (UA Hint)

The Above Graph has two sections. It shows enhanced model performance by introducing CoT and UA Hint tweaks.

With this suggested approach, by (Removing UA hint), the success rate (SR) of GPT -4 increased from 11.70% to 14.41% and despite an overall decline in identifying unachievable tasks, GPT-4 retains the capacity to recognise 44.44% of such tasks. It does so by generating reasons of non-achievability, even without explicit instructions.

2. Tasks that are created from the same template usually follow the same reasoning process. Out of 61 templates, GPT-4 was only able to run only 4 templates with 100% accuracy. Agents fail when there are minor variations within the same template.

3. Observation Bias : Sometimes the Agent presents the first known answer it encounters without verifying its relevance. For instance, the homepage of the E-Commerce CMS displays the best-selling items based on recent purchases, while historical best-seller data is typically accessed via a separate report.

You can see how different models perform in WebArena in their Leaderboard here.

As of this writing (Feb 10 2025), GPT Operator is the highest performing Agent in WebArena leaderboard.

At a high level, we got a glimpse of how WebArena works. Its crucial to understand how these next generation of Agents are trained to use our browser. While WebArena boasts training the agent with dynamic interaction, realistic environment with diversified human tasks and measure functional correctness, it is still a static environment at large. You are probably wondering how an Agent will perform in the real world of internet when asked. That's where WebVoyage steps in...

Reference : The entire article is an attempt to simplify this research paper. Most images in the articles are screenshots from this study

BetterHub Template, 👉 Download Design Assets Style Guide in Figma →

AI Agents with Browser [WebArena]