How to Create a LLM-guided Web Testing Agent with LangGraph and Selenium

This tutorial will explain how I created the “InteractionAgent” module of Eyegee-Exec and how you can use it to create your own LLM-guided Web Testing Agent. This agent is capable of autonomously generating test scenarios for web applications and executing them using Selenium.

Demo The agent testing autonomously testing a registration form.

Why LLM-guided Web Agents?

A LLM-guided Web Testing Agent can be a powerful tool for automating the testing of web applications. In the case of Eyegee-Exec, I use this type of agent to autonomously spider a website, map its interactions, and generate test scenarios for each interaction. This allows me to quickly test the functionality of a web application, identify outgoing API calls and getting a sense of the application’s structure.

I believe that in the future, agents like this will be a standard tool for web developers and testers. They will be able to automate some of the testing process, providing a foundation for more advanced testing requiring human intuition and creativity. This tool is a attempt or “proof of concept” of this idea.

I hope this tutorial will be useful to you or just give you a sense of what is possible with LangGraph and Selenium 😅.

Architecture Overview

I have tried a few different architectures for this agent. The one that I found most reliable/performant is the one shown below.

Architecture Overview of the agent’s architecture.

The High-High Level Planner accepts the interaction to test (as a string) and the URL, generating a list of concise approaches that represent different test scenarios. These approaches are abstract guides for the agent; for example, “Test the search field with a valid query” or “Test the search field with special characters”.

For each approach, the High Level Planner creates a detailed list of actions specifying the exact steps to execute. For instance, for “Test the search field with a valid query” the actions might be “Open the search field” “Type ‘test’ in the search field” and “Click the search button”.

The Executor takes these actions and performs them using Selenium. Based on the prebuilt LangGraph ReAct Agent, it can execute tool calls and adjust input parameters based on results. The tools are essentially Selenium commands, and the input parameters are the elements to interact with.

The High Level Replanner analyzes the initial plan and the Executor’s observed behavior, adjusting the plan if errors or unforeseen behaviors occur. For example, if a form introduces a second step with new inputs not initially considered, the Replanner updates the plan to include them. If satisfied with the execution, it proceeds to the final component.

The Reporter generates a report of the executed test scenarios, detailing the results of each action and any errors encountered during execution.

Of course, you having these many components is not necessary for a simple agent. However, in my experience, having these many abstraction layers makes the agent easier to maintain and less prone to hallucinations.

Setup

I recommend using using the code in the Repo to follow along with this tutorial. Since this system is quite complex, I will not be able to cover every detail in this tutorial. However, I will try to explain the most important parts.

For this tutorial, you will need to have Python installed on your machine. You will also need to install the following libraries:

selenium
langchain
langchain-openai
langchain-anthropic
langgraph
pydantic
python-dotenv
bs4
rich

Keep in mind that you will need to have a WebDriver installed for your browser of choice. For this tutorial, I will be using the Chrome WebDriver.

You will also need to create a .env file in the root of your project with either a ANTHROPIC_API_KEY or OPENAI_API_KEY variable.

ANTHROPIC_API_KEY=your_api_key
# or
OPENAI_API_KEY=your_api_key

Next, the config.py file contains the main configuration for the agent. Here you can set the specific model to use and the browser to use.

class Config:
    def __init__(self, website: str) -> None:
        load_dotenv()

        ####### Selenium #######
        self.chromedriver_path = "/usr/bin/chromedriver"
        self.service = ChromeService(executable_path=self.chromedriver_path)
        self.options = ChromeOptions()
        self.options.add_argument("--lang=en")
        self.options.add_experimental_option("perfLoggingPrefs", {"enableNetwork": True})
        self.options.set_capability("goog:loggingPrefs", {"performance": "ALL"})
        self.driver = webdriver.Chrome(service=self.service, options=self.options)
        self.selenium_rate = 0.5

        ####### Model #######
        # self.model = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
        # self.advanced_model = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
        self.model = ChatAnthropic(model_name="claude-3-5-haiku-latest", temperature=0.2)
        self.advanced_model = ChatAnthropic(
            model_name="claude-3-5-haiku-latest", temperature=0.2
        )
        self.parser = StrOutputParser()

        ####### Target #######
        try:
            parsed_url = urlparse(website)
            self.target = f"{parsed_url.scheme}://{parsed_url.netloc}"
            self.initial_path = parsed_url.path
        except Exception as e:
            logger.error(f"Error parsing target URL: {e}")
            exit(1)

The selenium options are set to log the performance of the browser, which allows us to capture network requests and responses. The selenium_rate variable is the time in seconds to wait between each Selenium command.

The model and advanced_model variables are the models used for the agent. In this case, I am using the claude-3-5-haiku-latest model from Anthropic. I added the advanced_model variable to allow for a better model for the report generation.

Implementation

The following functions define the components of the agent. Please take a look at the src/interaction_agent/agent.py file for the full implementation. I will only cover the most important parts here, omitting some details for brevity.

State

The State class is a Pydantic model that represents the state of the agent at any given time. Each component of the agent can access the state. The return value of each component is a dictionary that updates the state.

class State(TypedDict):
    uri: str # The URI of the page to test
    interaction: str # The interaction to test
    page_soup: str # The HTML of the page
    limit: str # The limit of approaches to generate
    approaches: List[str] # The approaches to test (populated by the high-high level planner)
    plans: List[PlanModel] # The plans to execute (populated by the high level planner)
    tests: Annotated[List[TestModel], operator.add] # The observed behavior (populated by the executor)
    report: str # The report of the test (populated by the reporter)

High-High Level Planner

The state is passed to the high_high_level_planner_step function, which generates a list of approaches based on the interaction and the page’s HTML.

def high_high_level_planner_step(state: State):
    approaches = high_high_level_planner.invoke(
        {
            "uri": state["uri"],
            "interaction": state["interaction"],
            "page_soup": state["page_soup"],
            "limit": state["limit"],
        }
    )
    # adjust length if above limit
    if len(approaches.approaches) > int(state["limit"]):
        approaches.approaches = approaches.approaches[: int(state["limit"])]
    return {"approaches": approaches.approaches}

The high-high level planner is a simple function that uses the LangGraph model to generate approaches based on the interaction and the page’s HTML.

high_high_level_planner = high_high_level_planner_prompt | self.cf.model.with_structured_output(HighHighLevelPlan)

Key here is the .with_structured_output(HighHighLevelPlan) method, which allows us to define the output structure of the model.

In this case the HighHighLevelPlan is a Pydantic model that represents the output of the high-high level planner.

class HighHighLevelPlan(BaseModel):
    """High-level plan for testing an interaction feature."""

    approaches: List[str] = Field(
        description="Different approaches to test an interaction feature, should be in sorted order. Pay attention to the specified limit of approaches."
    )

in src.interaction_agent.classes.py

High Level Planner

The high_level_plan_step function generates a list of plans based on the approaches generated by the high-high level planner.

def high_level_plan_step(state: State):
    plans = []
    for i, approach in enumerate(state["approaches"]):
        plan = high_level_planner.invoke(
            {
                "uri": state["uri"],
                "interaction": state["interaction"],
                "page_soup": state["page_soup"],
                "approach": approach,
                "interaction_context": format_context(state["interaction_context"]),
            }
        )
        plans.append(plan)
    return {"plans": plans}

Similar to the high-high level planner, the high level planner is a simple function that uses the LangGraph model to generate plans based on the approach and the page’s HTML.

Again, the key here is the .with_structured_output(HighLevelPlan) method, which allows us to define the output structure of the model.

Executor

The execute_step function essentially just sets up the environment to execute the plans generated by the high level planner and then stores the observations in the state.

The execute_step function executes the plans generated by the high level planner. This is by far the most complex part of the agent, due to the additional functionality required to parse outgoing requests and so on.

If we would only want to be executing the plans and gather the observations (without the additional functionality), we could use a more reduced executor like this:

def execute_step(state: State):
    tests = []
    uri = state["uri"]
    plans = state["plans"]
    # Execute each plan
    for i, plan in enumerate(state["plans"]):
        context = ToolContext(cf=self.cf, initial_uri=uri)  # Create a new context for each test
        tools = self._init_tools(context)
        solver = create_react_agent(
            self.cf.model, tools=tools, prompt=react_agent_prompt, output_parser=JSONAgentOutputParser()
        )
        solver_executor = AgentExecutor(agent=solver, tools=tools)
        soup_before = filter_html(BeautifulSoup(self.cf.driver.page_source, "html.parser"))
        test = TestModel(
            approach=plan.approach, steps=[], soup_before_str=soup_before.prettify(), plan=plan
        )
        self.cf.driver.get(f"{self.cf.target}{uri}")
        time.sleep(self.cf.selenium_rate)
        for j, task in enumerate(plan.plan):
            completed_task = CompletedTask(task=task)
            try:
                solved_state = solver_executor.invoke(
                    {
                        "task": task,
                        "interaction": state["interaction"],
                        "page_soup": state["page_soup"],
                        "approach": plan.approach,
                        "plan_str": "\n".join(plan.plan),
                    }
                )
                completed_task.status = solved_state["output"]["status"]
                completed_task.result = solved_state["output"]["result"]
            except Exception as e:
                completed_task.status = "error"
                completed_task.result = str(e)
            completed_task.tool_history = context.get_tool_history_reset()
            test.steps.append(completed_task)
        # getting page source
        test.soup_after_str = filter_html(BeautifulSoup(self.cf.driver.page_source, "html.parser")).prettify()

        tests.append(test)
    return {
        "tests": state["tests"] + tests,
        "plans": [],
    }

This function uses the create_react_agent function to create a ReAct agent that executes the plans. The AgentExecutor class is used to execute the agent and gather the observations, which are then stored in the tests variable.

The ToolContext object is used to store the state and observations of each tool call. This is necessary to be able to save information in a tool call and be able to pass the information to the Replanner.

Each test is a TestModel object, just containing information about the plan execution.

class TestModel(BaseModel):
    """Model for representing a test for a single approach."""

    approach: str = Field(description="The approach for the interaction feature.")
    plan: PlanModel = Field(description="The plan for this approach.")
    steps: List[CompletedTask] = Field(description="The steps executed for this approach.")
    soup_before_str: str = Field(description="The soup before the test.")
    soup_after_str: str = Field(default=None, description="The soup after the test.")
    outgoing_requests_after: List[ApiModel] = Field(default=None, description="The outgoing requests after the test.")
    checked: bool = Field(default=False, description="Whether the test has been checked for this approach.")
    in_report: bool = Field(default=False, description="Whether the test is in the final report.")

After executing the plans, the agent returns the updated state with the observations stored in the tests variable.

High Level Replanner

The replanner_step function analyzes the initial plan and the observed behavior, adjusting the plan if errors or unforeseen behaviors occur. It essentially just formats the observations and the plan into a string, which is then passed to a LLM. Depending on the output of the LLM, the plan is adjusted and the Executor is called again, or all plans are considered executed and the agent proceeds to the final component.

def high_level_replan_step(state: State):
    """
    Loops over all tests in current state and decides if a new plan is needed for each test.
    """
    new_plans = []
    tests = state["tests"]
    tests_to_check = [test for test in tests if not test.checked]
    tests_checked = [test for test in tests if test.checked]
    for i, test in enumerate(tests_to_check):
        uri = state["uri"]
        interaction = state["interaction"]
        page_source_diff = unified_diff(
            test.soup_before_str.splitlines(),
            test.soup_after_str.splitlines(),
            lineterm="",
        )
        page_source_diff = "\n".join(list(page_source_diff)).strip()
        input = {
            "uri": uri,
            "interaction": interaction,
            "approach": test.approach,
            "previous_plan": "\n".join([f"- {step}" for step in test.plan.plan]),
            "steps": format_steps(test.steps),
            "outgoing_requests": api_models_to_str(test.outgoing_requests_after),
            "page_source_diff": page_source_diff,
        }
        decision = high_level_replanner.invoke(input=input)
        if isinstance(decision.action, PlanModel):
            test.checked = True
            test.in_report = False
            tests_checked.append(test)
            new_plans.append(PlanModel(approach=test.approach, plan=decision.action.plan))
        elif isinstance(decision.action, Response):
            test.checked = True
            test.in_report = True
            tests_checked.append(test)
    return {"tests": tests_checked, "plans": new_plans}

The high_level_replanner is a simple function that uses the LangGraph model to generate a new plan based on the observations and the initial plan.

high_level_replanner = high_level_replanner_prompt | self.cf.model.with_structured_output(Act)

Key here is the .with_structured_output(Act) method, which allows us to define the output structure of the model.

In this case the Act is a Pydantic model that represents the output of the high level replanner.

class Act(BaseModel):
    """Action to perform."""

    action: Union[Response, PlanModel] = Field(
        description="Action to perform. If you want to respond to user, send a Response text. \n"
        "If you would like to provide a new, updated plan, use PlanModel."
    )

in src.interaction_agent.classes.py

The Union type allows us to define multiple possible outputs for the model, telling the LLM to either return a new plan or a response to the user.

Reporter

The report_step function again is pretty straightforward. It essentially again just formats the observations and the plan into a prompt, which is then passed to a LLM to generate a report. Additionally, the the LLM decides if any new interaction_context is needed. This allows the LLM to decide what important information to keep for the next interact call and include in their planning process (e.g. valid credentials from a registration form interaction could be passed to other interactions).

def report_step(state: State):
    interaction = json.dumps(state["interaction"], indent=4)
    uri = state["uri"]
    messages = [
        (
            "system",
            system_reporter_prompt.format(interaction=interaction, uri=uri)
            .replace("{", "{{")
            .replace("}", "}}"),
        )
    ]
    tests_to_report = [test for test in state["tests"] if test.in_report]
    for test in tests_to_report:
        steps = format_steps(test.steps)
        page_source_diff = unified_diff(
            test.soup_before_str.splitlines(),
            test.soup_after_str.splitlines(),
            lineterm="",
        )
        page_source_diff = "\n".join(list(page_source_diff)).strip()
        this_human_reporter_prompt = human_reporter_prompt.format(
            approach=test.approach,
            plan="\n".join(test.plan.plan),
            outgoing_requests=api_models_to_str(test.outgoing_requests_after),
            page_source_diff=page_source_diff,
            steps=steps,
        )
        messages.append(("user", this_human_reporter_prompt.replace("{", "{{").replace("}", "}}")))
    messages.append(("placeholder", "{messages}"))

    reporter_prompt = ChatPromptTemplate.from_messages(messages)
    reporter = reporter_prompt | self.cf.advanced_model.with_structured_output(ReporterOutput)

    out = reporter.invoke(input={})
            return {"report": out.report, "new_interaction_context": out.new_interaction_context}

Putting it all together

The _init_app function initializes the agent, defining the workflow and the components of the agent.

def _init_app(self):
    def should_report(state: State) -> Literal["executer", "reporter"]:
        if "plans" in state and state["plans"] == []:
            return "reporter"
        else:
            return "executer"

    # SNIP: define all the functions mentioned above here

    workflow = StateGraph(State)
    workflow.add_node("high_high_level_planner", high_high_level_planner_step)
    workflow.add_node("high_level_planner", high_level_plan_step)
    workflow.add_node("executer", execute_step)
    workflow.add_node("high_level_replanner", high_level_replan_step)
    workflow.add_node("reporter", report_step)

    workflow.add_edge(START, "high_high_level_planner")
    workflow.add_edge("high_high_level_planner", "high_level_planner")
    workflow.add_edge("high_level_planner", "executer")
    workflow.add_edge("executer", "high_level_replanner")
    workflow.add_conditional_edges("high_level_replanner", should_report)
    workflow.add_edge("reporter", END)
    app = workflow.compile()

    return app

We can then run the agent by calling the invoke method on the app.

final_state = self.app.invoke(
    input={
        "interaction": interaction,
        "uri": uri,
        "page_soup": soup,
        "limit": limit,
        "interaction_context": interaction_context,
    }
)

The ouput of the agent is the final state, which contains the report of the executed test scenarios, observed URIs API requests and information to be passed to further interactions.

If you use the code in the Repo you can run the agent by calling the demo.py file.

from config import Config
from src.interaction_agent.agent import InteractionAgent
from src.llm.api_parser import LLM_ApiParser

target = "https://demoqa.com/webtables"

print()
print("Starting interaction agent demo...")
print()
config = Config(target)
agent = InteractionAgent(cf=config, llm_page_request_parser=LLM_ApiParser(config))

agent.interact(uri="/webtables", interaction="Coworker Table", limit="1")

Conclusion

I hope this tutorial has given you a good overview of how to create a LLM-guided Web Testing Agent with LangGraph and Selenium. This agent is capable of autonomously generating test scenarios for web applications and executing them using Selenium.

Feel free to check out Eyegee-Exec for a more complete implementation of this agent. Here is a Video Demo of Eyegee-Exec in action. If you have any questions or feedback, feel free to reach out to me 😄.