Does This Look Right To You, AI?

Using LLMs to check the web application against product features summaries during E2E tests.

How does a human quality assurance engineer (QA) test a web application? If I ask you to look at this TodoMVC web app, how do you know what features it has?

Example TodoMVC app

From your life experience, you might guess that this app lets the user enter new Todo items and should show them in the list somewhere below the input element. This knowledge comes from you seeing 100s of "TodoMVC" apps in your life. But what about this crescent symbol? Does it mean anything?

What does this symbol do, if anything?

Hmm, the only way to know what to expect and how to test this part of the application is to read the product description of the application. Otherwise, you will simply be guessing!

AI testing

Let me clear one thing right away: giving an AI an URL and asking it to write 5-10 end-to-end tests is a pointless task. I don't think it will work any time soon. But if we give an AI agent specific instructions and supporting information, like the application source code, it can create an accurate test. Let's confirm the user can enter an item:

GitHub Copilot prompt to write a test with the generated result

The result is a passing end-to-end test

GitHub Copilot wrote a passing test

Notice how I gave the agent the context: the "todo" folder that contains the source code for the application. What happens if the agent does not have the source code OR cannot trace how the front-end generates the DOM from the data? Let's say I remove the placeholder attribute from the HTML input element. Without "todo" folder, the LLM "guesses" based on its trained model that the input element in the typical TodoMVC app has the placeholder attribute!

AI guesses elements to use

Hmm, there is no input element with this placeholder; so the test fails

AI guessed wrong. There is no such input element

Product description

Should our application have an input element like this <input placeholder="Create a new todo..." ...>? Here are two pieces of advice I want to give in this blog post:

  1. the description of what the user should see on the page (in the DOM) should come from the product feature description text document
  2. the AI should check generated DOM rather than understand how the application will produce it

Let's describe our TodoMVC's main feature: adding todos. I will simply add a Markdown "todo/readme.md" file with the description of application's "add" feature. Then I can reference this file when prompting the AI agent to write the same test

AI prompt that uses product description

The generated test does not look at the application implementation. It simply follows the user-facing properties from the README file.

AI-generated test following the product description

Since our web application implementation was correct, the test is passing; both the input and the list elements have the expected values.

AI-generated test following the product description is correct

Does this look wrong?

Now, let's leap to the new heights.

Imagine that the product owner decides that your TodoMVC web app should be loud. It should show the Todo items in all uppercase letters. The product owner updates the app description

todo/readme.md
1
2
3
4
5
6
## Features

- when the user enters a todo in the input field with `data-cy="add-todo"` and presses "Enter",
the new todo item appears in the list of todos. Each todo should have attribute `data-cy="todo"`
- no matter how the user enters the input, the list of todos should show
each item using all uppercase letters

Great. Do not write / update the web test yet! Imagine you are a human QA. Given this updated product description, could you tell if the app is correct or not? Does this look wrong to you?

Does the application confirm to the product description?

As a human, I can tell that the app is wrong. The entered todo item "Buy milk" is NOT shown in all uppercase. I do not need to look into the application source code to know that. I simply see the result in the rendered DOM / HTML of the app that part of the end-to-end test "adds 1 todo".

Items are clearly present in the page DOM

I can do it as a human being. Even in unrelated end-to-end tests, I can look at the result of each step and see that something is wrong: the application does not satisfy the product description!

Can an AI agent detect that something is wrong?

AI-able end-to-end test trace

Rather than start writing MCP servers, I think we should work on making more tools product machine-readable or as I call it "AI-able" output. For example, I have written a simply Cypress logger than produces a summary plus the DOM snapshot after each command. Let's run the test again and look at the produced test JSON log

AI-able e2e test output log JSON

Great. Now comes the beautiful part. I call a script file that calls an AI model and feeds it the following prompt:

1
2
3
4
5
6
7
8
9
10
Given the follow application feature descriptions
confirm the web app shows the expected DOM HTML elements after each test step.

{tood/readme.md} contents

{join all steps}
step: {step.name} {step.args.join(' ')}

The HTML snapshot of the page after the step was executed is:
{step.html}

So we create a long input of text with HTML of the page after each command finishes. Here is what GPT-4.1 model tells us about our application:

AI detects a problem in our DOM against product description

Having a DOM snapshot makes is very easy for LLM to check the web application and even explain the expected result - since it knows the entered Todo item "Buy milk"

AI explains the problem in the DOM contents

Notice, we haven't even modified the test, we simply "looked" at the web application and checked if it satisfies the product description invariants. Easy win.

AI-TDD: AI-Driven Test-Drive Development

Let's implement the "ALL CAPS" feature. In the "todo/script.js" I will change the DOM text

todo/script.js
1
2
// set todo item for card
item.textContent = todo.item.toUpperCase()

Our E2E test is broken, but that's ok (for now).

The end-to-end test has not been updated yet to account for uppercase list items

But the application IS behaving according to its production specification!

AI has checked the DOM against the product description at each step

Now that AI has inspected the application and confirmed that is is correct, we can update the test to make the testing much faster and cheaper. We can even ask AI to update it using the product description of course!

Use the product description to update the E2E Cypress test

The updated test simply checks the uppercase text

AI agent has updated the test

The updated test is passing again

Passing updated test

Having good precise product spec is thus the key to coding the app, writing E2E tests, and checking the result at each step. Having a system that never tires "looking" at the DOM to check if any product features are broken, even in unrelated tests could be a pretty sweet deal. All you need to do is ask the AI: "does this look right to you?"

Happy testing.

👎 While I am not making the source code for this blog post public, you can still learn how I use GitHub Copilot and Cursor for writing end-to-end tests. Grab my online courses Write Cypress Tests Using GitHub Copilot and Cursor AI For Cypress Testing.