Build RAG Using Chroma DB

Find similar Cypress recipes and examples using semantic search and retrieval-augmented generation.

Imagine you ask AI/LLM how to do a particular task by describing what you want to accomplish in code comments.

test-examples/spec2.cy.js
1
2
3
4
5
6
7
8
// this is a Cypress end-to-end test

it('changes the label after the click', () => {
cy.visit('/')
// TODO: there is label element with id "foo"
// and a button with id "bar"
// after clicking the button, the text in the label with id "foo" should change
})

You fire off Cursor or GitHub Copilot... and it might give you a good code suggestion or it might suggest absolute nonsense. Let's take this answer that used Claude Sonnet 3.5 model

Test implementation suggested by Claude Sonnet 3.5

For easy search, the suggestion used an alias to the old text

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
it('changes the label after the click', () => {
cy.visit('/')
// get the initial label text and store it
cy.get('#foo')
.invoke('text')
.as('initialText')

// click the button
cy.get('#bar').click()

// verify the label text has changed
cy.get('#foo')
.invoke('text')
.then((newText) => {
cy.get('@initialText').then((initialText) => {
expect(newText).to.not.equal(initialText)
})
})
})

Notice two things: the solution adopts the variable names and data from my prompt; it used the #foo and #bar selectors. Good. But the solution has a subtle bug in how it uses an aliased value that might make this test flakey depending on the timer, see Text Changes and the video below for details:

Can we improve the answer given by LLM?

Give good examples

Humans can do a lot following an example. This is why 11 years ago I suggested putting example comments in your source code. A similar approach works with AI generation: prefix your question (prompt) with good examples showing how to achieve the desired outcome. For example, you can manually post the above recipe Text Changes and then ask the LLM:

Prefix the question with a good relevant example

Ok, what does LLM do now? It follows the good example!

The same question answered when given a relevant example

Even older models can generate high-quality code when provided good, well-tested, trustworthy examples and information.

Trustworthy information

Unfortunately, the world wild web cannot be trusted to have accurate information. As someone who is reading a lot of blog posts, I notice a lot of incorrect examples and solutions that are missing important context, have hidden bugs, etc. LLMs trained across wider and wider swaths of Internet do not know which information is accurate, and which information is just some JavaScript snippet posted on a page.

The sourced of tested, accurate, up-to-date coding knowledge should be a premium. I am maintaining a few such knowledge databases for Cypress end-to-end tests. For example, Cypress Examples has almost 1000 constantly tested Cypress tests covering all cy commands and various testing situations.

My cypress-examples repo has more than passing 800 Cypress tests

Similarly, I have example repos that are constantly tested for my online courses like Cypress Network Testing Examples and Cypress Plugins, etc. Each course has hundreds of lessons, thus the total number of high-quality Cypress tests is close to another 1000. How do we use them to answer any current prompts?

Retrieval-augmented Generation

Easy. Before generating an answer to our current prompt, we find a similar example using semantic code and text search. Then we include the example in the full prompt we send to an LLM. This is what Retrieval (search for example) Augmented (and include it with your prompt) Generation (adopt the example to your current situation) aka RAG is.

We could use a regular Algolia / full-text search to find examples matching the current prompt. Or we could use semantic meaning to quickly find similar examples. Here is one RAG implementation that I played with. It uses ChromaDB to store Markdown documents I prepared. It also can quickly find examples close to new code fragments.

Prepare documents

I use Markdown to store Cypress code examples and even to run them as tests. For retrieval, I extract blocks of examples from Markdown docs using markdown-search-scraper CLI tool and store them in ChromaDB running locally.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import { parseForAi } from 'markdown-search-scraper/src/parse-for-ai.js'
import { ChromaClient } from 'chromadb'
import { DefaultEmbeddingFunction } from '@chroma-core/default-embed'

const modelName = 'all-MiniLM-L6-v2'
const defaultEF = new DefaultEmbeddingFunction(modelName)

const parsed = loadedFiles
.map((record) => {
const aiRecords = parseForAi(record.markdown)
aiRecords.forEach((aiRecord, k) => {
aiRecord.url = record.url
aiRecord.filename = record.filename
aiRecord.id = record.filename + '-' + k
})
return aiRecords
})
.flat()
console.log('%d parsed records', parsed.length)

const client = new ChromaClient({
ssl: false,
host: 'localhost',
port: 8000,
})

const collection = await client.getOrCreateCollection({
name: 'cypress-tips',
embeddingFunction: defaultEF,
})

const documents = parsed.map((record) => record.text)
const ids = parsed.map((record) => record.id)
const metadatas = parsed.map((record) => ({
markdown: record.content,
}))

await collection.add({
documents,
ids,
metadatas,
})
console.log('Added %d documents', documents.length)
  • parsing Markdown for AI strips the code, but keeps the code comments
  • I let ChromaDB prepare an embedding vector from the Markdown file. It produces a long array of numbers like [0.2, 0.89, ...] from Markdown text.
  • I store the original Markdown in the DB as metadata. An alternative implementation would store a link to the original Markdown stores in another database
  • it takes a while to insert all 1000 Markdown examples into ChromaDB. I suggest using text hashes to only update examples that changed to save time

Retrieval

Once we want to ask LLM a question, we query ChromaDB to find if it has any documents close to the query text.

query.mjs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import { ChromaClient } from 'chromadb'
import { DefaultEmbeddingFunction } from '@chroma-core/default-embed'

const modelName = 'all-MiniLM-L6-v2'
const defaultEF = new DefaultEmbeddingFunction(modelName)

const client = new ChromaClient({
ssl: false,
host: 'localhost',
port: 8000,
})

const query = process.argv[2]
if (!query) {
console.error('Usage: node query.mjs "<search query>"')
process.exit(1)
}

const collection = await client.getCollection({
name: 'cypress-tips',
embeddingFunction: defaultEF,
})

const results = await collection.query({
queryTexts: [query], // Chroma will embed this for you
nResults: 5, // how many results to return
})

// console.log(results)
// we had only a single query result
const rows = results.rows()[0]
rows.forEach((result, k) => {
console.log('====')
console.log(`Result ${k + 1} with distance ${result.distance}`)
console.log('====')
console.log(result.metadata.markdown)
})

Let's pretend we want to find a code example before asking LLM to implement the test

$ node ./query.mjs "text updated to something else after click"

Result 1 with distance 0.6559537

In this example, we want to confirm that the text on the page changes after the user clicks the button. We do not know the initial text, just know that is changes in response to the click.

<div id="output">Original text</div>
<button id="change">Do it</button>
  document
    .getElementById('change')
    .addEventListener('click', () => {
      // change the text, but do it after a random delay,
      // almost like the application is loading something from the backend
      setTimeout(() => {
        document.getElementById('output').innerText = 'Changed!'
      }, 1000 + 1000 * Math.random())
    })

cy.get('#output')
  .invoke('text')
  .then((text) => {
    cy.get('#change').click()
    cy.get('#output').should('not.have.text', text)
  })

Watch the explanation video Confirm The Text On The Page Changes After A Click.

See also Counter increments

Result 2 with distance 0.91608334
...

Nice, and notice how it "decided" that works like "text updated" are close in meaning to "the text on the page changes" - the match is NOT exact, but close in semantic meaning. The distance drop off between 0.65 and 0.91 is quite large, so we know the first result is much closer than the second.

Now we can insert the found example into the original LLM prompt and generate a good solution, either manually or via scripting

1
2
3
4
5
6
const examples = await RAG(prompt)
const fullPrompt = `
following the examples in ${examples}
answer the ${prompt}
`
const answer = await myLLM(fullPrompt)

Tip: ChromaDB can be used with other AI embeddings, see Embedding Integrations.

See also