https://arxiv.org/pdf/2411.06559
Language agents have demonstrated promising capabilities in automating webbased tasks, though their current reactive approaches still underperform largely compared to humans.
While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase.
In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments.
Our method, WEBDREAMER, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities.
Specifically, WEBDREAMER uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step.
Empirical results on two representative web agent benchmarks with online interaction—VisualWebArena and Mind2Web-live—demonstrate that WEBDREAMER achieves substantial improvements over reactive baselines.
By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction.
More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.
4 WEBDREAMER: MODEL-BASED PLANNING FOR WEB AGENTS
In this paper, we propose WEBDREAMER, a pioneering approach leveraging LLMs as world models to enable efficient planning in complex digital environments.
Our approach is motivated by the observation that web interfaces, despite their complexity, are designed to be predictable for human users.
When browsing websites, humans can effectively anticipate action outcomes based on visual cues and common design patterns—clicking a "Submit" button leads to form submission, selecting a product image navigates to its detail page.
Given that LLMs are trained on vast amounts of web-related data, we hypothesize that they have acquired sufficient knowledge to simulate the consequences of user actions, potentially serving as effective world models for planning.
4.1 CORE DESIGN
WEBDREAMER follows the planning through simulation paradigm introduced in Section 3.2.
Figure 2 illustrates this process with three candidate actions, where WEBDREAMER simulates two-step trajectories for each action, selects the trajectory with the highest score, and executes its corresponding initial action.
At its core, WEBDREAMER leverages an LLM to implement both the simulation function sim and the scoring function score.
Implementation for sim: Our implementation of sim consists of two modules: one predicts state changes after action execution, approximating T, while the other imagines a possible action based on the predicted state.
Together, these two modules generate trajectories of length H, where H is a configurable horizon parameter (i.e., the simulation depth).
Specifically, to represent the state changes, we prompt the LLM to generate a concise natural language description focusing only on the effects of the action.
For example, in Figure 2, the LLM would output a short description as follows when prompted to predict the effect of executing the action Click "Electronics":
Based on this predicted state, the LLM then imagines the next action (i.e., Click "Computers & Accessories"), which leads to another state change prediction.
This process generates a trajectory of horizon H = 2.
Implementation for score: After collecting a trajectory τi simulated from each candidate action ai using sim, we further use the LLM as a scoring function for each simulation.
Following Koh et al. (2024b), we prompt the LLM to evaluate each simulated trajectory with a threescale response—complete (1.0), on track (0.5), or incorrect (0)—indicating its progress toward task completion.
The final score is computed by averaging multiple samples of these evaluations.
In addition to sim and score, a prerequisite to planning is candidate action generation.
We employ a two-stage approach: first sampling top-k actions following Koh et al. (2024b), then using LLM to self-refine unnecessary actions for simulation.
This selfrefinement step is motivated by our observation that at different steps, the same k can introduce varying degrees of irrelevant actions— some steps naturally have fewer plausible actions than others.
We show the pseudo code of WEBDREAMER's overall design in Algorithm 1.
termination check verifies if the model outputs a stop action, reaches max steps, or repeats an action over 3 times, also following the implementation by Koh et al. (2024b).
All system prompts used in WEBDREAMER can be found in Appendix A.A PROMPTS FOR FOUR STAGES IN MPC-BASED PLANNING
A.1 ACTION PROPOSAL
-------------------------------------------------
You are an autonomous intelligent agent tasked with navigating a web browser.
You will be given web-based tasks.
These tasks will be accomplished through the use of specific actions you can issue.
Here's the information you'll have: {Web Information} The user's objective: {Task Objective} This is the task you're trying to complete.
The current web page screenshot: {Web Page Screenshot Image} This is a screenshot of the webpage, with each interactable element assigned a unique numerical id.
Each bounding box and its respective id shares the same color.
The observation, which lists the IDs of all interactable elements on the current web page with their text content if any, in the format [id][tagType][text content].
tagType is the type of the element, such as button, link, or textbox.
text content is the text content of the element.
For example, [1234][button]['Add to Cart'] means that there is a button with id 1234 and text content 'Add to Cart' on the current web page.
[][StaticText][text] means that the element is of some text that is not interactable.
The current web page's URL: {Web URL} This is the page you're currently navigating.
The open tabs: {Previous Tabs} These are the tabs you have open.
The previous action: {Previous Action} This is the action you just performed.
It may be helpful to track your progress.
The actions you can perform fall into several categories:
Page Operation Actions:
- click [id]: This action clicks on an element with a specific id on the webpage.
- type [id] [content]: Use this to type the content into the field with id.
- By default, the Enter key is pressed after typing unless press enter after is set to 0, i.e., type [id] [content] [0].
- hover [id]: Hover over an element with id.
- press [key comb]: Simulates the pressing of a key combination on the keyboard (e.g., Ctrl+V)
- scroll [down] or scroll [up]: Scroll the page up or down.
Tab Management Actions:
- new tab: Open a new, empty browser tab.
- tab focus [tab index]: Switch the browser's focus to a specific tab using its index.
- close tab: Close the currently active tab.
URL Navigation Actions:
- goto [url]: Navigate to a specific URL.
- go back: Navigate to the previously viewed page.
- go forward: Navigate to the next page (if a previous go back action was performed).
Completion Action:
- stop [answer]: Issue this action when you believe the task is complete.
If the objective is to find a text-based answer, provide the answer in the bracket.
Homepage: If you want to visit other websites, check out the homepage at http://homepage.com.
It has a list of websites you can visit.
http://homepage.com/password.html lists all the account name and password for the websites.
You can use them to log in to the websites.
To be successful, it is very important to follow the following rules:
1. You should only issue an action that is valid given the current observation
2. You should only issue one action at a time.
3. You should follow the examples to reason step by step and then issue the next action.
4. Generate the action in the correct format.
Start with a "In summary, the next action I will perform is" phrase, followed by action.
For example, In summary, the next action I will perform is click [1234].
5. Issue stop action when you think you have achieved the objective.
Don't generate anything after stop.
A.2 SELF-REFINEMENT
-------------------------------------------------
You are assisting a web navigation agent to help a human user navigate a website to complete a task.
Given the user's intent, the action history, and the current state of the webpage, the agent has proposed a set of candidate actions to take at the current step.
Your role is not to determine a best action for the agent at this step, but to filter out the actions that are very likely not relevant or helpful for the agent to accomplish the task.
Please select all actions that you think that could possibly lead the agent to accomplish the task.
It's important to note that to accomplish a task, the agent will execute a sequence of actions.
So the action to take at this step does not have to immediately lead to the completion of the task.
You should select any action that could be relevant for the agent to take in the current state of the webpage.
Try to be as thoughtful and comprehensive as you can!
Don't miss any possible action.
If there is one action that is clearly the best, and all other actions are clearly not very relevant, you can only select one action.
Please do this sparely, since some actions may be helpful in a longer horizon.
A action should be included as long as it could be relevant to the task, even if it may not be the most direct action to take at this step!!
Some relevant actions might seem indirect at the first glance, but could be helpful in a longer horizon.
Please also include those actions.
Please at least select one action.
*IMPORTANT* Format your response into two lines as shown below:
Thoughts: . You must explicitly evaluate each action one by one and imagine whether it could be relevant to the task following the format: action:... rationale:...
Selected actions: id0;id1;aid2;... (please return the index of the action in the candidate actions list, starting from 0.
Don't output the action description itself.
Separate the indices with semicolons.
Do not add spaces or any other characters between after the semicolons.)
Action History: {last actions str}
Current URL: {current url}
The images corresponding to the user intent are shown in the FIRST {len(intent images)} images (before the User Intent).
The last {len(screenshots)} snapshots of the agent's trajectory are shown in the LAST {len(screenshots)} images.
The LAST IMAGE represents the current state of the webpage.
Proposed Action: {action descriptions}
A.3 WORLD MODEL
-------------------------------------------------
You are an agent that predicts the effect of an action on a webpage.
You will be given a screenshot of a webpage, a sequence of actions and state changes applied to the initial screenshot, and an operation to perform on the webpage.
You are required to predict the new changes that will occur on the webpage after the operation is performed, such as the appearance of new elements, the disappearance of existing elements, or changes in the content of existing elements.
The operation type and the element to operate will be provided in the prompt.
Directly output State changes:... and don’t output anything else.
Try to be as comprehensive and detailed as possible.
Based on the initial screenshot and the changes to the webpage, please predict the changes after action:
A.4 REWARD MODEL
------------------------------------------------
You are an expert in evaluating the performance of a web navigation agent.
The agent is designed to help a human user navigate a website to complete a task.
Given the user’s intent, the agent’s action history, the current state of the webpage, your goal is to decide **whether the simulated steps by the agent indicate a successful execution of the user intent**.
In particular, if the predicted state (i.e., the current state represented by the last image plus all the predicted changes so far) corresponds to a successful final state.
If it is a failure but it looks like the simulated steps are on the right track towards success, you should also output as such.
Note that, in the simulated steps, all the state changes are predicted by the agent’s world model, and they may not actually be faithful to the real website interactions (e.g., some proposed actions may not be avaiable in a realistic website).
You should also account for this in your evaluation (e.g., if the predicted state changes are not reasonable then it’s probably a failure).
*IMPORTANT* Format your response into two lines as shown below: Thoughts: Status: "success" or "failure" On the right track to success: "yes" or "no"