Test PDFs By Converting Them To HTML

How to download a PDF file, convert it to HTML, and load back to Cypress browser for more testing

Let's take Filip Hric's Testing a PDF file with Cypress and see if we can play with it a little. In his example, Filip downloads a PDF and reads it as text using pdf-parse NPM utility. Then test then checks if the PDF text contains a sentence we expect it to have.

cypress/e2e/final.cy.ts
1
2
3
4
5
6
7
8
9
it('downloads a simple PDF file', () => {
cy.visit('/')
cy.contains('simple.pdf').click()
// wait for the file to be downloaded
cy.readFile('cypress/downloads/simple.pdf', 'utf8')
cy.task('readPdf', 'cypress/downloads/simple.pdf')
// yields the text from the PDF file
.should('contain', 'Hello darkness my old friend')
})

The PDF download and text confirmation test

The cy.task calls the method defined in the cypress.config.ts file that calls the Node code to parse the PDF file already on disk

cypress.config.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
const { defineConfig } = require('cypress')
import { readPdf } from 'cypress/scripts/readPdf'

module.exports = defineConfig({
e2e: {
supportFile: false,
setupNodeEvents(
on: Cypress.PluginEvents,
config: Cypress.PluginConfigOptions,
) {
on('task', {
readPdf,
})
},
baseUrl: 'http://localhost:3000',
trashAssetsBeforeRuns: false,
},
})
cypress/scripts/readPdf.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
const fs = require('fs')
const path = require('path')
const pdf = require('pdf-parse')

export const readPdf = (pathToPdf: string) => {
return new Promise((resolve) => {
const pdfPath = path.resolve(pathToPdf)
let dataBuffer = fs.readFileSync(pdfPath)
pdf(dataBuffer).then(function ({ text }) {
resolve(text)
})
})
}

🎁 I forked Filip's repo filiphric/testing-pdf-with-cypress into mine bahmutov/testing-pdf-with-cypress. In this blog post I show my versions and tweaks to the original specs.

PDF to HTML

But what if we could see the PDF file? What if we could load PDF into the browser and then query it using "normal" Cypress commands like cy.contains, cy.get, etc? Wouldn't that be cool? We would see the PDF contents right in the screenshots and videos taken during the cypress run.

Let's do this!

Let's install pdf2html and call it via a task to get HTML of the PDF

cypress.config.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
const { defineConfig } = require('cypress')
import { readPdf } from 'cypress/scripts/readPdf'
const { promisify } = require('util')
const pdf2html = require('pdf2html')

const toHtml = promisify(pdf2html.html)

module.exports = defineConfig({
e2e: {
supportFile: false,
setupNodeEvents(
on: Cypress.PluginEvents,
config: Cypress.PluginConfigOptions,
) {
on('task', {
readPdf,
toHtml,
})
},
baseUrl: 'http://localhost:3000',
trashAssetsBeforeRuns: false,
},
})

Instead of the readPdf task, let's call toHtml task which should yield the PDF converted into HTML string

cypress/e2e/final.cy.ts
1
2
3
4
5
6
7
8
9
10
11
it('converts downloaded PDF to HTML', () => {
cy.visit('/')

cy.contains('simple.pdf').click()

// wait for the file to be downloaded
cy.readFile('cypress/downloads/simple.pdf', 'utf8')
cy.task('toHtml', 'cypress/downloads/simple.pdf')
// yields the HTML from the PDF file
.should('contain', 'Hello darkness my old friend')
})

Ok, it works

The downloaded PDF was converted to HTML string

Ughh, I don't like that long HTML string in the Cypress Command Log, even if the assertion is green. Is there any place better to drop the HTML in?

Write HTML to the document

Just like I have done in my Full End-to-End Testing for Your HTML Email Workflows presentation and the blog post Testing HTML Emails using Cypress, let's write the produced HTML string into the document object - then the browser will show it.

cypress/e2e/final.cy.ts
1
2
3
4
5
6
7
8
9
10
11
12
it('converts downloaded PDF to HTML', () => {
cy.visit('/')

cy.contains('simple.pdf').click()

// wait for the file to be downloaded
cy.readFile('cypress/downloads/simple.pdf', 'utf8')
cy.task('toHtml', 'cypress/downloads/simple.pdf').then((html) => {
cy.document({ log: false }).invoke({ log: false }, 'write', html)
})
cy.contains('Hello darkness my old friend')
})

The test converts PDF to HTML and writes it into the browser before testing its content

Notice how we are using cy.contains command to confirm the downloaded "simple.pdf" file has the text we are looking for?

Tip: if you want to see a small PDF better, change the viewport after writing the HTML string with cy.viewport command

Complex PDF

Let's click the "complex.pdf" and confirm its contents

1
2
3
4
5
6
7
8
9
10
11
12
it('tests the complex pdf', () => {
cy.visit('/')

cy.contains('complex.pdf').click()

// wait for the file to be downloaded
cy.readFile('cypress/downloads/complex.pdf', 'utf8')
cy.task('toHtml', 'cypress/downloads/complex.pdf').then((html) => {
cy.document({ log: false }).invoke({ log: false }, 'write', html)
})
cy.contains('Total €9 504,00')
})

Testing the complex PDF contents

Hmm, I promised to be able to see the PDF file and test it using Cypress commands, I haven't promised to make it pretty.