Sometimes people ask me how to verify their sitemap.xml
files using Cypress tests. While Cypress test runner is not meant for crawling website, it is pretty capable of quickly checking your pages against silly "404: page not found" errors after the deployment.
🎁 You can find the source code for this blog post in the repo bahmutov/vuepress-cypress-test-example which is verifying the site map file https://vuepress-cypress-test-example.netlify.app/sitemap.xml
The sitemap resource
When I build the static site, I generate a sitemap.xml
file that tells search crawlers about all available pages. A typical sitemap for a small site only has a few page URLs:
We want to check if all URLs listed in the sitemap are working. We need to load the XML resource, parse it, then iterate over the list of URLs. We can simply request each page and check if successfully resolves with HTTP code 200. If the site has JavaScript we probably want to visit each page to make sure it does not throw a JavaScript error. Let's test it.
Single test
The first way we can write a Cypress test is inside a single it
callback. I will use the cy.request command to get the sitemap.xml
resource, then use the NPM module x2js to parse XML text into a JavaScript object.
1 | { |
1 | const X2JS = require('x2js') |
I like adding assertions in the middle of the test. In the test above, I am verifying the json.urlset.url
variable in the middle of the test to confirm we have URLs to visit. We can even click on the assertion to see the list of URLs in the DevTools:
Let's verify each URL. We can check an URL in three ways:
- check if the resource exists by fetching it using the
HEAD
HTTP method. This saves time by only requesting the resource header, rather than the entire page. The time savings could be substantial, for example requesting the index of my blog takes three times longer compared to just getting the header:
1 | $ time http HEAD https://glebbahmutov.com/blog/ |
- request the entire page using
GET
method - visit the page using cy.visit command
My test below will use all three ways.
1 | // get all URLs from the sitemap |
I am using .wait(1000, {log: false})
after each cy.visit
command to make the loaded page clearly visited in the captured test run video.
You can watch me writing this test in the video below
Data-driven tests
All URLs are checked in the same test. If a single URL fails, then the entire test stops, and we do not know if there are any other broken URLs. We also have to look at the failure message or screenshot to figure which URL failed to load. It would be nice if we had a separate test for each URL instead. This is where the plugin cypress-each can help us.
Before we can generate separate tests, we must have the URLs ready. We cannot use cy.request
to fetch the sitemap first, then generate new tests to run. We must fetch the sitemap before the spec loads. The best way to do this is to fetch the sitemap from the plugin file and pass the list to the spec file using the Cypress.env object.
You can use any NPM module or plain Node code to fetch the sitemap, I will use got and then will put the fetched list into the config.env
object.
1 | const got = require('got') |
When I open the Cypress project, I should see the list of URLs in the "Settings / Configuration" tab.
When the spec loads in the browser, the sitemapUrl
list is already set and is immediately available using Cypress.env('sitemapUrl')
command. Now we can import the cypress-each
plugin, which adds the it.each
method to the global it
function. We will have a separate test for each URL.
1 | import 'cypress-each' |
The tests run and finish successfully
We can always inspect each test using the time-traveling debugger
You can watch me write the separate tests in the video below
Tip: for more tricks with data-driven tests using cypress-each
plugin including running the tests in parallel, read the blog post Refactor Tests To Be Independent And Fast Using Cypress-Each Plugin.
Bonus
If you do not have sitemap, and need to crawl the local site pages by discovering the anchor links, it can be done by writing a recursive Cypress function, see the video Crawl Local Pages Using Cypress.