How does a human quality assurance engineer (QA) test a web application? If I ask you to look at this TodoMVC web app, how do you know what features it has?
From your life experience, you might guess that this app lets the user enter new Todo items and should show them in the list somewhere below the input element. This knowledge comes from you seeing 100s of "TodoMVC" apps in your life. But what about this crescent symbol? Does it mean anything?
Hmm, the only way to know what to expect and how to test this part of the application is to read the product description of the application. Otherwise, you will simply be guessing!
AI testing
Let me clear one thing right away: giving an AI an URL and asking it to write 5-10 end-to-end tests is a pointless task. I don't think it will work any time soon. But if we give an AI agent specific instructions and supporting information, like the application source code, it can create an accurate test. Let's confirm the user can enter an item:
The result is a passing end-to-end test
Notice how I gave the agent the context: the "todo" folder that contains the source code for the application. What happens if the agent does not have the source code OR cannot trace how the front-end generates the DOM from the data? Let's say I remove the placeholder
attribute from the HTML input element. Without "todo" folder, the LLM "guesses" based on its trained model that the input element in the typical TodoMVC app has the placeholder attribute!
Hmm, there is no input element with this placeholder; so the test fails
Product description
Should our application have an input element like this <input placeholder="Create a new todo..." ...>
? Here are two pieces of advice I want to give in this blog post:
- the description of what the user should see on the page (in the DOM) should come from the product feature description text document
- the AI should check generated DOM rather than understand how the application will produce it
Let's describe our TodoMVC's main feature: adding todos. I will simply add a Markdown "todo/readme.md" file with the description of application's "add" feature. Then I can reference this file when prompting the AI agent to write the same test
The generated test does not look at the application implementation. It simply follows the user-facing properties from the README file.
Since our web application implementation was correct, the test is passing; both the input and the list elements have the expected values.
Does this look wrong?
Now, let's leap to the new heights.
Imagine that the product owner decides that your TodoMVC web app should be loud. It should show the Todo items in all uppercase letters. The product owner updates the app description
1 | ## Features |
Great. Do not write / update the web test yet! Imagine you are a human QA. Given this updated product description, could you tell if the app is correct or not? Does this look wrong to you?
As a human, I can tell that the app is wrong. The entered todo item "Buy milk" is NOT shown in all uppercase. I do not need to look into the application source code to know that. I simply see the result in the rendered DOM / HTML of the app that part of the end-to-end test "adds 1 todo".
I can do it as a human being. Even in unrelated end-to-end tests, I can look at the result of each step and see that something is wrong: the application does not satisfy the product description!
Can an AI agent detect that something is wrong?
AI-able end-to-end test trace
Rather than start writing MCP servers, I think we should work on making more tools product machine-readable or as I call it "AI-able" output. For example, I have written a simply Cypress logger than produces a summary plus the DOM snapshot after each command. Let's run the test again and look at the produced test JSON log
Great. Now comes the beautiful part. I call a script file that calls an AI model and feeds it the following prompt:
1 | Given the follow application feature descriptions |
So we create a long input of text with HTML of the page after each command finishes. Here is what GPT-4.1 model tells us about our application:
Having a DOM snapshot makes is very easy for LLM to check the web application and even explain the expected result - since it knows the entered Todo item "Buy milk"
Notice, we haven't even modified the test, we simply "looked" at the web application and checked if it satisfies the product description invariants. Easy win.
AI-TDD: AI-Driven Test-Drive Development
Let's implement the "ALL CAPS" feature. In the "todo/script.js" I will change the DOM text
1 | // set todo item for card |
Our E2E test is broken, but that's ok (for now).
But the application IS behaving according to its production specification!
Now that AI has inspected the application and confirmed that is is correct, we can update the test to make the testing much faster and cheaper. We can even ask AI to update it using the product description of course!
The updated test simply checks the uppercase text
The updated test is passing again
Having good precise product spec is thus the key to coding the app, writing E2E tests, and checking the result at each step. Having a system that never tires "looking" at the DOM to check if any product features are broken, even in unrelated tests could be a pretty sweet deal. All you need to do is ask the AI: "does this look right to you?"
Happy testing.
👎 While I am not making the source code for this blog post public, you can still learn how I use GitHub Copilot and Cursor for writing end-to-end tests. Grab my online courses Write Cypress Tests Using GitHub Copilot and Cursor AI For Cypress Testing.