Crawl Local Pages Using Cypress

How to visit every local page from a Cypress test to verify it successfully loads.

Sometimes you want to simply visit every local page on your site to make sure the links are correct and every page loads. Cypress is not a crawler, but it can definitely handle the crawl for smaller sites. In the videos below I show how to collect every anchor link, filter external links, and visit every collected URL once.

🎁 You can find the full source code in my repository bahmutov/cypress-crawl-example.

Collect the URLs

The best way to write a crawler is to think about the actions on every page. The crawler needs to:

  1. grab the first URL to visit from a queue
  • if there are no URLs to visit, we are done
  1. call cy.visit(url)
  2. collect all anchor elements
  • filter external links
  • filter links we have already visited
  • filter links we have already queued up to visit
  • add the filtered URLs to the queue
  1. go to step 1

You can see my implementation of the above steps in the test file spec.js and watch the implementation in the video below:

Perfect, at the end of the test each URL has been visited, but some pages were visited twice - because the crawler does not know that links to /tos.html and /tos lead to the same page.

Visiting the same page via two different links

Resolving URLs

To prevent visiting the same page via different links, we need to check if a given URL leads to a page we have visited already. We can do this by using the cy.request command and inspecting the redirects array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// check URLs by requesting them and getting through the redirects
const redirected = []
localUrls.forEach((url) => {
cy.request({ url, log: false })
.its('redirects', { log: false })
// resources without redirects will
// not have the property "redirects"
// so prevent Cypres from throwing an error
.should(Cypress._.noop)
.then((redirects) => {
if (Array.isArray(redirects) && redirects.length > 0) {
// each redirect record is like "301: URL"
// so grab the last redirect and parse it
// that will be the final address
const redirectedUrl =
redirects[redirects.length - 1].split(' ')[1]
// keep just the local part of the full URL
const parsed = new URL(redirectedUrl)
redirected.push(parsed.pathname)
} else {
redirected.push(url)
}
})
})

You can find the full source code in the file spec2.js and the explanation in the video below.

Bonus: check the 404 resource

The crawl example has one additional test file 404-spec.js that shows how to verify the error page the site serves when you try to visit a non-existent URL. Again, we can use a combination of cy.request and cy.visit commands to verify the status code and the error page served. We do need to let the commands work on the status code 4xx by using failOnStatusCode: false option

1
2
3
4
5
6
7
8
it('shows 404 error', () => {
const url = '/does-not-exist'
cy.request({ url, failOnStatusCode: false })
.its('status', { timeout: 0 })
.should('eq', 404)
cy.visit(url, { failOnStatusCode: false })
cy.contains('span', '404').should('be.visible')
})

You can find the explanation in the video below

Happy Crawling 🕷