AI Picks Tests To Run On A Bug

Let AI pick a test tag and run end-to-end tests when the user opens a GitHub bug issue.

In the blog post Test Tag Suggestions Using AI I described a system to pick a testing tag based on a pull request's title and body text. In this blog post, I will make it useful. Whenever a user opens a GitHub issue and labels it a "bug", an automated workflow will pick an appropriate testing tag (or several) and will execute the tagged Cypress end-to-end tests to give more context to the issue.

The example application

I am using a typical TodoMVC with lots of Cypress end-to-end tests tagged using @bahmutov/cy-grep plugin. You can list all specs with their tags using find-cypress-specs utility.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
$ npx find-cypress-specs --names

cypress/e2e/app-spec.js (15 tests)
โ””โ”€ TodoMVC - React
โ”œโ”€ adds 4 todos [@smoke, @add]
โ”œโ”€ When page is initially opened
โ”‚ โ””โ”€ should focus on the todo input field
โ”œโ”€ No Todos
โ”‚ โ””โ”€ should hide #main and #footer [@misc]
โ”œโ”€ New Todo [@add]
โ”‚ โ”œโ”€ should allow me to add todo items
โ”‚ โ”œโ”€ adds items
โ”‚ โ”œโ”€ should clear text input field when an item is added
โ”‚ โ”œโ”€ should append new items to the bottom of the list
โ”‚ โ”œโ”€ should trim text input
โ”‚ โ””โ”€ should show #main and #footer when items added
โ”œโ”€ Item
โ”‚ โ”œโ”€ should allow me to mark items as complete
โ”‚ โ”œโ”€ should allow me to un-mark items as complete
โ”‚ โ””โ”€ should allow me to edit an item
โ””โ”€ Clear completed button
โ”œโ”€ should display the correct text
โ”œโ”€ should remove completed items when clicked [@smoke]
โ””โ”€ should be hidden when there are no items that are completed

cypress/e2e/completed-spec.js (3 tests)
โ””โ”€ TodoMVC - React
โ””โ”€ Mark all as completed [@complete]
โ”œโ”€ should allow me to mark all items as completed
โ”œโ”€ should allow me to clear the complete state of all items
โ””โ”€ complete all checkbox should update state when items are completed / cleared

cypress/e2e/counter-spec.js (2 tests)
โ””โ”€ TodoMVC - React
โ””โ”€ Counter [@count]
โ”œโ”€ should not exist without items
โ””โ”€ should display the current number of todo items

cypress/e2e/editing-spec.js (5 tests)
โ””โ”€ TodoMVC - React
โ””โ”€ Editing [@edit]
โ”œโ”€ should hide other controls when editing
โ”œโ”€ should save edits on blur [@smoke]
โ”œโ”€ should trim entered text
โ”œโ”€ should remove the item if an empty text string was entered
โ””โ”€ should cancel edits on escape

cypress/e2e/persistence-spec.js (1 test)
โ””โ”€ TodoMVC - React
โ””โ”€ Persistence [@persistence]
โ””โ”€ should persist its data [@smoke]

cypress/e2e/routing-spec.js (5 tests)
โ””โ”€ TodoMVC - React
โ””โ”€ Routing [@routing]
โ”œโ”€ should allow me to display active items
โ”œโ”€ should respect the back button [@smoke]
โ”œโ”€ should allow me to display completed items
โ”œโ”€ should allow me to display all items @smoke
โ””โ”€ should highlight the currently applied filter

found 6 specs (31 tests)

We have a few feature testing tags. Let's count the number of tests by tag

1
2
3
4
5
6
7
8
9
10
11
12
$ npx find-cypress-specs --tags

Tag Tests
------------ -----
@add 7
@complete 3
@count 2
@edit 5
@misc 1
@persistence 1
@routing 5
@smoke 5

The @smoke tag cuts across all features, following the approach described in How To Tag And Run End-to-End Tests.

A problem

Let's say our application has a bug, somehow we introduced a problem into the "toggle all" function logic. No one caught the problem during the code review, and no one bothered to run the end-to-end tests (๐Ÿ™€)

1
2
3
4
5
6
7
8
9
10
11
12
13
  app.TodoModel.prototype.toggleAll = function (checked) {
// Note: it's usually better to use immutable data structures since they're
// easier to reason about and React works very well with them. That's why
// we use map() and filter() everywhere instead of mutating the array or
// todo items themselves.
this.todos = this.todos.map(function (todo) {
- return Utils.extend({}, todo, { completed: checked })
+ // introduce an error on purpose by negating the checked value
+ return Utils.extend({}, todo, { completed: !checked })
})

this.inform()
}

Hmm, we have deployed the app with a bug, and soon a user opens a GitHub issue. Knowing the typical user, the level of detail in the GH issue is minimal.

A user opens an issue

Great. The issue has the title "toggle does not work", an empty body text, and has the label "bug".

The bug workflow

Opening or re-opening an issue labeled "bug" triggers the following GitHub Actions workflow

.github/workflows/bug.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# this workflow runs when a user opens a new issue labeled "bug"
name: bug
on:
issues:
types: [opened, reopened]

jobs:
pick-test-tag:
if: contains(github.event.issue.labels.*.name, 'bug')
runs-on: ubuntu-24.04
outputs:
testTag: ${{ steps.find.outputs.testTag }}
inputTokens: ${{ steps.find.outputs.inputTokens }}
outputTokens: ${{ steps.find.outputs.outputTokens }}
totalTokens: ${{ steps.find.outputs.totalTokens }}
model: ${{ steps.find.outputs.model }}
steps:
- name: Pick the test tag
id: find
uses: bahmutov/run-tests-on-a-bug/.github/actions/pick-test-tag@main
with:
title: ${{ github.event.issue.title }}
body: ${{ github.event.issue.body }}
env:
OPEN_AI_API_KEY: ${{ secrets.OPEN_AI_API_KEY }}

run-picked-tests:
if: contains(github.event.issue.labels.*.name, 'bug')
needs: pick-test-tag
runs-on: ubuntu-24.04
permissions:
# this job needs to check out the source code
contents: read
# give this job permission to comment on the issue
issues: write
steps:
- name: Print issue title and subject
run: |
echo "Issue title:"
echo "${{ github.event.issue.title }}"
echo "Issue body:"
echo "${{ github.event.issue.body }}"
echo "Picked test tag(s)"
echo "${{ needs.pick-test-tag.outputs.testTag }}"

- name: Comment on the issue ๐Ÿ“
# https://github.com/peter-evans/create-or-update-comment
uses: peter-evans/create-or-update-comment@v4
id: comment
with:
issue-number: ${{ github.event.issue.number }}
token: ${{ secrets.GITHUB_TOKEN }}
body: |
Thanks for reporting this issue! We will look into it as soon as we can.

In the meantime, we are running tests tagged with `${{ needs.pick-test-tag.outputs.testTag }}` to see if anything is broken.
The GitHub Actions run url is here: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}.

- name: Checkout ๐Ÿ›Ž
uses: actions/checkout@v5

- name: Run tagged tests ๐Ÿงช
# https://github.com/cypress-io/github-action
uses: cypress-io/github-action@v6
with:
# let's see which specs and tests we will run
build: npx find-cypress-specs --names --tagged ${{ needs.pick-test-tag.outputs.testTag }}
start: npm run start:ci
wait-on: 'http://localhost:8888'
env:
CYPRESS_grepTags: ${{ needs.pick-test-tag.outputs.testTag }}
# put test results into the comment
# https://github.com/bahmutov/cypress-set-github-status
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
COMMENT_ID: ${{ steps.comment.outputs.comment-id }}

# after the test run completes store videos and any screenshots
# https://github.com/actions/upload-artifact
- uses: actions/upload-artifact@v4
if: failure()
with:
name: cypress-screenshots
path: cypress/screenshots
if-no-files-found: ignore
- uses: actions/upload-artifact@v4
if: always()
with:
name: cypress-videos
path: cypress/videos
if-no-files-found: ignore

Currently, the example application repo is private. I am thinking how to better open source this work.

The workflow runs only for issues labeled a "bug":

1
2
3
on:
issues:
types: [opened, reopened]

A response comment appears quickly

The bug workflow posts the response comment

There are two jobs in the workflow: "pick-test-tag" followed by "run-picked-tests"

Picking the testing tags

Based on the user's description of the bug (title and body), we want to know if any of the tested features related to the user's report are broken. Because there might be more than a single broken page or user action, we might have a seriously broken app! We want to test everything related to the bug report, and hopefully the test recordings and logs will help us quickly isolate the problem and fix the issue.

To pick the testing tag based on the user's text, I use the following AI script

.github/actions/pick-test-tag/pick.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
// @ts-check

/**
* These are valid test tags used in our test cases,
* plus their descriptions
*/
const TEST_TAGS = {
'@smoke': 'Smoke tests - a small set of tests to check the main features',
'@misc': 'Miscellaneous unimportant tests',
'@add': 'Tests related to adding new todo items to the list',
'@edit': 'Tests related to editing existing todo items in the list',
'@routing':
'Tests related to routing between different views and pages in the app',
'@complete': 'Tests related to completing tasks and checking/unchecking',
'@count': 'Tests confirming the count of items on the page is correct',
'@persistence':
'Tests related to data persistence: saving and loading items in storage',
}

async function ask(instructions, input, core, client) {
// https://platform.openai.com/docs/models
// usually gpt-4.1-mini or gpt-4.1
const model = 'gpt-4.1'
const response = await client.responses.create({
model,
instructions,
input,
})

let pickedTestTag = response.output_text.trim()
console.error('model %s response:\n%s\n', model, pickedTestTag)
console.error('response usage:')
console.error(response.usage)

// parse test tags and confidence scores
// into the variable pickedTestTags

if (pickedTestTags.length === 0) {
// if the tag is not in the list, use @sanity
console.warn(`Could not pick any known tags. Using @sanity instead.`)
pickedTestTags.push({ tag: '@smoke', confidence: 1 })
}

// set actions outputs
const pickedTags = pickedTestTags.map((tag) => tag.tag).join(',')
core.setOutput('testTag', pickedTags)
core.setOutput('inputTokens', response.usage.input_tokens)
core.setOutput('outputTokens', response.usage.output_tokens)
core.setOutput('totalTokens', response.usage.total_tokens)
core.setOutput('model', model)

console.error('Returning test tag: %s', pickedTags)
return pickedTags
}

const testTagsText = Object.entries(TEST_TAGS)
.map(([tag, desc]) => {
return ` ${tag} ${desc}`
})
.join('\n')

const instructions =
`Given the following end-to-end test tags:
${testTagsText}
` +
`Determine which test tag is applicable to the following code changes.

Return the list of all applicable test tags, one test tag per line.
In addition to the test tag, print the confidence score for each tag in parentheses, from 0 to 1, where 1 is the highest confidence.
For example:
@edit (0.9)
@persistence (0.8)
@add (0.3)

If no test tag is applicable, return "@smoke (1.0)".
`

const input = process.env['USER_TEXT']
if (!input) {
throw new Error(
'USER_TEXT environment variable is required. This should be a string with the pull request title and body',
)
}

let openAiApiKey = process.env['OPEN_AI_API_KEY']
if (!openAiApiKey) {
throw new Error('OPEN_AI_API_KEY environment variable is required')
}

// output logging into error stream
const separator = '====='
console.error('Asking OpenAI using the instructions and input below...')
console.error(input)
console.error(separator)
console.error(instructions)
console.error(separator)

/**
* This exported function can be called by the GitHub Action
* or from the command line.
*/
module.exports = async ({ core, OpenAI }) => {
const client = new OpenAI({
apiKey: openAiApiKey,
})

const answer = await ask(instructions, input, core, client)

// log just the answer - a single test tag or several test tags separated by commas
console.log(answer)
return answer
}

This script can be called from the command line or from a reusable GitHub Action, here is my YML file

.github/actions/pick-test-tag/action.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
name: Find the test tag
description: Suggests a web test tag based on the issue title and body

inputs:
title:
description: 'Issue title'
type: string
required: true
body:
description: 'Issue body'
type: string
required: false
default: ''

outputs:
testTag:
description: 'Recommended test tag'
value: ${{ steps.find.outputs.testTag }}
inputTokens:
description: 'Number of input tokens used'
value: ${{ steps.find.outputs.inputTokens }}
outputTokens:
description: 'Number of output tokens used'
value: ${{ steps.find.outputs.outputTokens }}
totalTokens:
description: 'Total number of tokens used'
value: ${{ steps.find.outputs.totalTokens }}
model:
description: 'Model used for the request'
value: ${{ steps.find.outputs.model }}

runs:
using: 'composite'
steps:
- uses: actions/setup-node@v4
with:
node-version: 22

- name: Install **limited** dependencies ๐Ÿ“ฆ
# only install the packages needed to run the script
run: npm install openai
shell: bash

- name: Determine the test tag ๐Ÿท๏ธ
id: find
# note: this step produces multiple outputs
# - testTag
# - inputTokens
# - outputTokens
# - totalTokens
# - model
# https://github.com/actions/github-script
uses: actions/github-script@v8
with:
script: |
const OpenAI = require('openai')
const pick = require('${{ github.action_path }}/pick.js');
await pick({ core, OpenAI });
env:
# hopefully the text does not have double quotes
USER_TEXT: "${{ inputs.title }}\n\n${{ inputs.body }}"

- name: Print the determined tag ๐Ÿท๏ธ
shell: bash
run: |
echo "The recommended test tag is: ${{ steps.find.outputs.testTag }}" >> $GITHUB_STEP_SUMMARY

Great, so what does it find?

The picked testing tag

Based on the user's description of the problem "toggle does not work", the LLM picked the testing tag @complete. Its description "Tests related to completing tasks and checking/unchecking" was the best match to the user text. Personally, I found LLMs to be hit or miss for creating new code, but pretty accurate for picking one of the limited number of variants. After all, the second "L" in LLM stands for "language", it better do such semantic language matches well!

I even believe that small local LLMs can solve this "pick the closes text" problem, but don't have any proof.

Running the picked tests

Once we picked just one testing tag @complete with 100% confidence, we execute it using the Cypress GitHub Action that I wrote back in the day. Our project uses my plugin cypress-set-github-status to post the individual spec results back into the original comment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
- name: Run tagged tests ๐Ÿงช
# https://github.com/cypress-io/github-action
uses: cypress-io/github-action@v6
with:
# let's see which specs and tests we will run
build: npx find-cypress-specs --names --tagged ${{ needs.pick-test-tag.outputs.testTag }}
start: npm run start:ci
wait-on: 'http://localhost:8888'
env:
# pass the picked testing tag(s) to the @bahmutov/cy-grep plugin
CYPRESS_grepTags: ${{ needs.pick-test-tag.outputs.testTag }}
# put test results into the comment
# https://github.com/bahmutov/cypress-set-github-status
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
COMMENT_ID: ${{ steps.comment.outputs.comment-id }}

Here is the relevant Cypress config code

cypress.config.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// other config code
setupNodeEvents(on, config) {
// if needed, write the test results back into a GitHub comment
const token = process.env.GITHUB_TOKEN
const comment = process.env.COMMENT_ID
if (token && comment) {
console.log(
'Will write test results into the comment with id %s',
comment,
)
require('cypress-set-github-status')(on, config, {
owner: 'bahmutov',
repo: 'run-tests-on-a-bug',
token,
comment,
})
}

// optional: register cy-grep plugin code
// https://github.com/bahmutov/cy-grep
require('@bahmutov/cy-grep/src/plugin')(config)

// make sure to return the config object
// as it might have been modified by the plugin
return config
}

Once the test results come in, the original issue comment is updated with details: 2 tests failed.

The comment is updated with the failed test titles

If our project was recording test traces on the Cypress Dashboard, the comment would include a link to the run URL. For now, we simply go to the GitHub Actions run URL and download the screenshots or videos of the test run.

GitHub Actions run job summary has the links to the screenshots and videos

Let's download the screenshots. Hmm, the failed test clicked on the "Toggle All" button, yet each item remained incomplete. The test result points us in the right direction; we should be looking at the JavaScript code that is executed in response to the user's click on the "Toggle All" element.

Cypress failed test screenshot

Great. We automatically ran the relevant tests based on the user's input, collecting lots of information that should help us quickly debug the problem and deploy a fix.