How To Check Broken Image Urls In The New Blog Posts

Scrape new blog posts using Cypress and check if any images or URLs are broken.

Imagine you write and publish a blog post only to check it later and see a broken image link:

A broken image in my blog

This is the markup for the image link on my Hexo post.

Broken image markup

Let's check each newly published blog post and confirm the images actually load. We could use command-line scraping tools, but using Cypress makes such checking really simple. Any broken image or link is easy to understand using Cypress reporting, which is another plus.

First, we need to limit ourselves to the newly published or edited blog posts. No need to constantly re-scrape and re-test the blog posts that stay the same. We can grab the newly edited blog posts from the sitemap XML file https://glebbahmutov.com/blog/sitemap.xml. It is an XML file generated by the Hexo framework from the Markdown blog posts and the file timestamps

Sitemap XML file

Let's grab the sitemap and check the last 3 posts. We can parse the XML text using the plugin x2js as I described in the blog post Test your sitemap using Cypress.

cypress/e2e/broken-images.cy.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// use https://github.com/abdolence/x2js to parse XML to JSON
const X2JS = require('x2js')
const x2js = new X2JS()

describe('Broken images', () => {
before(() => {
cy.request('/sitemap.xml')
.its('body')
.then(x2js.xml2js.bind(x2js))
.its('urlset.url')
// the posts are sorted from the most recent to the oldest
// let's take the last 3 posts
.invoke('slice', -3)
.then(posts => {
// each post is an object with two keys: loc and lastmod
console.table(posts)
})
})

it('checks each recent blog post', () => {
// TODO
})
})

This is what the test prints to the browser console

Last three blog post links

Important: the sitemap.xml shows the deployed post URLs, even when running the blog locally using the npm start command. We can change the absolute production URLs to relative URLs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
const productionUrl = 'https://glebbahmutov.com/blog'
let postUrls

before(() => {
cy.request('/sitemap.xml')
.its('body')
.then(x2js.xml2js.bind(x2js))
.its('urlset.url')
// the posts are sorted from the newest to the oldest
// let's take the first 3 posts
.invoke('slice', 0, 3)
.then(posts => {
postUrls = posts.map(post => post.loc).map(loc => loc.replace(productionUrl, ''))
})
})

Now let's visit each post and confirm the included images load. We can find all images inside the blog post itself using the selector ``:

1
2
3
4
5
6
7
8
it('checks each recent blog post', () => {
postUrls.forEach(url => {
cy.visit(url)
cy.get('.article-entry img')
// a blog post might have no images
.should(Cypress._.noop)
})
})

Blog post images

We can either check the URL or, better, if the images load.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
it('checks each recent blog post', () => {
postUrls.forEach(url => {
cy.visit(url)
cy.get('.article-entry img')
// a blog post might have no images
.should(Cypress._.noop)
.each($img => {
const src = $img.attr('src')
const alt = $img.attr('alt')
if (!alt) {
throw new Error(`missing alt attribute for image ${src}`)
}
expect($img[0], `"${alt}" at ${src}`).to.have.property('naturalWidth').and.be.gt(0)
})
})
})

Each assertion shows the information about the image: the alt text and src property. If an image is missing, it would be simple to track it down. Let's say an image uses a broken link:

Broken image link fails the test

We probably want to make the output on CI easier to understand without downloading the failed test screenshots. Thus we can print a message to the terminal with the blog post URL we are testing. Let's use the plugin cypress-log-to-term. After installing, we simply add cy.log commands

1
2
3
4
postUrls.forEach((url, k) => {
cy.log(`checking post ${k + 1} / ${postUrls.length} at ${url}`)
...
})

Which prints to the terminal the current post URL

1
2
3
checking post 1 / 3 at /check-urls-in-the-new-blog-posts/
checking post 2 / 3 at /check-broken-images/
checking post 3 / 3 at /cypress-map-should-read-assertion/

Here is how it looks when running on GitHub Actions

Checking image links on CI

See also