Test your sitemap using Cypress

How to load and parse sitemap XML resource and then confirm each page loads using the Cypress.io test runner.

Sometimes people ask me how to verify their sitemap.xml files using Cypress tests. While Cypress test runner is not meant for crawling website, it is pretty capable of quickly checking your pages against silly "404: page not found" errors after the deployment.

🎁 You can find the source code for this blog post in the repo bahmutov/vuepress-cypress-test-example which is verifying the site map file https://vuepress-cypress-test-example.netlify.app/sitemap.xml

The sitemap resource

When I build the static site, I generate a sitemap.xml file that tells search crawlers about all available pages. A typical sitemap for a small site only has a few page URLs:

Example sitemap.xml resource

We want to check if all URLs listed in the sitemap are working. We need to load the XML resource, parse it, then iterate over the list of URLs. We can simply request each page and check if successfully resolves with HTTP code 200. If the site has JavaScript we probably want to visit each page to make sure it does not throw a JavaScript error. Let's test it.

Single test

The first way we can write a Cypress test is inside a single it callback. I will use the cy.request command to get the sitemap.xml resource, then use the NPM module x2js to parse XML text into a JavaScript object.

cypress.json
1
2
3
4
5
{
"fixturesFolder": false,
"supportFile": false,
"baseUrl": "https://vuepress-cypress-test-example.netlify.app/"
}
cypress/integration/sitemap-spec.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
const X2JS = require('x2js')

describe('sitemap', () => {
it('fetches the sitemap.xml', () => {
// https://on.cypress.io/request
cy.request('/sitemap.xml')
.its('body')
.then((body) => {
const x2js = new X2JS()
const json = x2js.xml2js(body)
// get all URLs from the sitemap
expect(json.urlset.url).to.be.an('array').and.have.length.gt(0)
})
})
})

I like adding assertions in the middle of the test. In the test above, I am verifying the json.urlset.url variable in the middle of the test to confirm we have URLs to visit. We can even click on the assertion to see the list of URLs in the DevTools:

We parsed the sitemap.xml and have URLs to check

Let's verify each URL. We can check an URL in three ways:

  • check if the resource exists by fetching it using the HEAD HTTP method. This saves time by only requesting the resource header, rather than the entire page. The time savings could be substantial, for example requesting the index of my blog takes three times longer compared to just getting the header:
1
2
3
4
5
$ time http HEAD https://glebbahmutov.com/blog/
real 0m0.398s

$ time http GET https://glebbahmutov.com/blog/
real 0m1.209s
  • request the entire page using GET method
  • visit the page using cy.visit command

My test below will use all three ways.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// get all URLs from the sitemap
expect(json.urlset.url).to.be.an('array').and.have.length.gt(0)

json.urlset.url.forEach((url) => {
const parsed = new URL(url.loc)
cy.log(parsed.pathname)

// check if the resource exists
cy.request('HEAD', url.loc).its('status').should('eq', 200)
// check if the resource exists AND download it
cy.request(url.loc).its('status').should('eq', 200)
// visit the page to check if it loads in the browser
cy.visit(url.loc).wait(1000, { log: false })
})

I am using .wait(1000, {log: false}) after each cy.visit command to make the loaded page clearly visited in the captured test run video.

Checking each URL in three different ways

You can watch me writing this test in the video below

Data-driven tests

All URLs are checked in the same test. If a single URL fails, then the entire test stops, and we do not know if there are any other broken URLs. We also have to look at the failure message or screenshot to figure which URL failed to load. It would be nice if we had a separate test for each URL instead. This is where the plugin cypress-each can help us.

Before we can generate separate tests, we must have the URLs ready. We cannot use cy.request to fetch the sitemap first, then generate new tests to run. We must fetch the sitemap before the spec loads. The best way to do this is to fetch the sitemap from the plugin file and pass the list to the spec file using the Cypress.env object.

You can use any NPM module or plain Node code to fetch the sitemap, I will use got and then will put the fetched list into the config.env object.

cypress/plugins/index.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
const got = require('got')
// use https://github.com/abdolence/x2js to parse XML to JSON
const X2JS = require('x2js')

module.exports = async (on, config) => {
const sitemapUrl = `${config.baseUrl}/sitemap.xml`
const xml = await got(sitemapUrl).text()
const x2js = new X2JS()
const json = x2js.xml2js(xml)
const urls = json.urlset.url.map((url) => url.loc)
console.log(urls)

config.env.sitemapUrls = urls
// make sure to return the changed config
return config
}

When I open the Cypress project, I should see the list of URLs in the "Settings / Configuration" tab.

The URLs fetched in the plugin file are available to every spec

When the spec loads in the browser, the sitemapUrl list is already set and is immediately available using Cypress.env('sitemapUrl') command. Now we can import the cypress-each plugin, which adds the it.each method to the global it function. We will have a separate test for each URL.

cypress/integration/sitemap-each.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import 'cypress-each'

describe('Sitemap', () => {
// I like testing the input list of URLs in its own test
// you could also use "before" hook to confirm we have the URLs
it('has urls', () => {
expect(Cypress.env('sitemapUrls')).to.be.an('array').and.not.be.empty
})

const urls = Cypress.env('sitemapUrls').map((fullUrl) => {
const parsed = new URL(fullUrl)
return parsed.pathname
})

it.each(urls)('url %s', (url) => {
// check if the resource exists
cy.request('HEAD', url).its('status').should('eq', 200)
// check if the resource exists AND download it
cy.request(url).its('status').should('eq', 200)
// visit the page to check if it loads in the browser
cy.visit(url).wait(1000, { log: false })
})
})

The tests run and finish successfully

Checking each URL using its own separate test

We can always inspect each test using the time-traveling debugger

Inspect the page loaded by a previous test

You can watch me write the separate tests in the video below

Tip: for more tricks with data-driven tests using cypress-each plugin including running the tests in parallel, read the blog post Refactor Tests To Be Independent And Fast Using Cypress-Each Plugin.

Bonus

If you do not have sitemap, and need to crawl the local site pages by discovering the anchor links, it can be done by writing a recursive Cypress function, see the video Crawl Local Pages Using Cypress.