Aug 21 2021

Scrape Slides

How I scraped Reveal.js presentations using Cypress.io and sent the results to Algolia search index.

I have a lot of presentations at slides.com/bahmutov, and it is difficult to find a specific slide sometimes, even when I know there is one. I was looking for ways to scrape my presentations and send the search records to Algolia for indexing. In this presentation, I will show the scraper I have written for this purpose.

The presentation
The slide format
The scrape selectors
Browsing and scraping the slides
- Be careful about the stack
- Scraping multiple elements
Filtering records
Algolia application
- Object IDs
Replacing objects
The search page
Scraping any URL
See also

🎁 You can find the source code for this blog post at bahmutov/scrape-book-quotes. This blog is necessarily long, since it needs to provide every relevant detail, but the ultimate truth is in the code.

The presentation

For this blog post I have created a small presentation slides.com/bahmutov/book-quotes with a few slides with famous book quotes. The slides are implemented using Reveal.js framework that I like using. Here is the overview of the slides: there is a main horizontal row, and a single column in the middle.

Presentation to be scraped

The slide format

When editing Reveal.js slides, you can add text and various headers. Commonly, I use "Heading 1" for each slide's title.

The slide title is Heading 1

The "Heading 1" text becomes the "H1" HTML element. The "Heading 2" becomes "H2" element, and so on. Regular text becomes "P" HTML element. You can see these elements marked in the HTML screenshot below.

Heading 1 becomes the H1 element

The slide deck also has an element below the presentation's title and description.

The deck info markup

The title and the description could be considered the top-level information in the deck.

The currently shown slide has the class "present". We can scrape each slide one by one.

The scrape selectors

The default Algolia scraper does not work very well with the highly dynamic Single-Page Application like a Reveal.js presentation. Thus we need to browse the slides, grab the text from the elements, and send the records to Algolia index ourselves.

In order to scrape each slide, we need to select the h1, h2, p elements. Algolia documents the various text levels in its config documentation. In our case, the selectors I picked are:

lvl0: ".deck-info h1"
lvl1: ".deck-info .description"
lvl2: ".slides .present h1"
lvl3: ".slides .present h2"
lvl4: ".slides .present h3"
content: ".slides .present p, .slides .present blockquote"

Note: potentially we could grab all document elements' using the selectors without the .present class and form the individual slide URLs like bahmutov/book-quotes, bahmutov/book-quotes#/1, bahmutov/book-quotes#/2, bahmutov/book-quotes#/2/1, etc. ourselves. But I would think really browsing the slides is more fun, isn't it?

Browsing and scraping the slides

To load the presentation, browse the slides, and scrape the HTML elements, I will use Cypress.io test runner. To go through each slide, I am using cypress-recurse plugin. See the video below to learn how the test goes through the slides.

While browsing, let's extract the matching elements from the slide and save them in an array to be uploaded to the Algolia index later. Here is the initial code

cypress/integration/spec.js

/// <reference types="cypress" />

import { recurse } from 'cypress-recurse'

it('scrapes', () => {
  const records = []

  const scrape = () => {
    return cy.document().then((doc) => {
      const url = doc.location.href

      const lvl0El = doc.querySelector('.deck-info h1')
      const lvl0 = lvl0El ? lvl0El.innerText : null

      const lvl1El = doc.querySelector('.deck-info .description')
      const lvl1 = lvl1El ? lvl1El.innerText : null

      const lvl2El = doc.querySelector('.slides .present h1')
      const lvl2 = lvl2El ? lvl2El.innerText : null

      const lvl3El = doc.querySelector('.slides .present h2')
      const lvl3 = lvl3El ? lvl3El.innerText : null

      const lvl4El = doc.querySelector('.slides .present h3')
      const lvl4 = lvl4El ? lvl4El.innerText : null

      // TODO: consider ALL elements, not just the first one
      const textEl = doc.querySelector(
        '.slides .present p, .slides .present blockquote',
      )
      const content = textEl ? textEl.innerText : null

      const record = { url, lvl0, lvl1, lvl2, lvl3, lvl4, content }
      console.log(record)
      records.push(record)
    })
  }
  cy.visit('/')

  const goVertical = () => {
    return recurse(
      () => scrape().then(() => cy.get('.navigate-down')),
      ($button) => !$button.hasClass('enabled'),
      {
        log: false,
        delay: 1000,
        timeout: 200000,
        limit: 200,
        post() {
          cy.get('.navigate-down').click()
        },
      },
    )
  }

  recurse(
    () => goVertical().then(() => cy.get('.navigate-right')),
    ($button) => !$button.hasClass('enabled'),
    {
      log: false,
      delay: 1000,
      timeout: 200000,
      limit: 200,
      post() {
        cy.get('.navigate-right').click()
      },
    },
  )
})

The records accumulate in the records list with each slide, as you can see in the DevTools console.

Scraping each slide

We can save the records as a JSON file to be sent to Algolia next.

// recurse through the slides
const records = []
recurse(...)
  .then(() => {
    cy.writeFile('records.json', records)
  })

records.json

[
  {
    "url": "https://slides.com/bahmutov/book-quotes/",
    "lvl0": "Book Quotes",
    "lvl1": "A test deck for practicing scraping slides.",
    "lvl2": "Anna Karenina",
    "lvl3": "Leo Tolstoy",
    "lvl4": null,
    "content": "Happy families are all alike; every unhappy family is unhappy in its own way."
  },
  {
    "url": "https://slides.com/bahmutov/book-quotes/#/1",
    "lvl0": "Book Quotes",
    "lvl1": "A test deck for practicing scraping slides.",
    "lvl2": "Moby Dick",
    "lvl3": "Herman Melville",
    "lvl4": null,
    "content": "Call me Ishmael."
  },
  ...
]

Be careful about the stack

Reveal.js decks can have columns of slides. The column is called a stack, and it also has its own "present" class.

<section class="stack present">
  ... previous slides
  <section class="present">
    the current visible slide
  </section>
  ... future slides
</section>

Thus to grab the current slide we need to use the class "present", but without the "stack" class. In CSS this can be expressed as .present:not(.stack) selector. Thus our content selector that pulls the p, blockquote, and li items is:

const contentSelectors = [
  '.slides .present:not(.stack) p',
  '.slides .present:not(.stack) blockquote',
  '.slides .present:not(.stack) li',
]
const selector = contentSelectors.join(', ')

Scraping multiple elements

A single slide might have multiple paragraphs, list items, and block quotes which are all separate content items. If the slide has any heading elements, the content items should all share the same "lvl0", "lvl1", etc. For example, the next slide produces 4 separate content records, all sharing the "Heading 2" at "lvl3":

A slide with 4 records

[
  {
    "url": "https://slides.com/bahmutov/book-quotes/#/5",
    "lvl0": "Book Quotes",
    "lvl1": "A test deck for practicing scraping slides.",
    "lvl2": null,
    "lvl3": "A Bullet List",
    "lvl4": null,
    "content": "Bullet One"
  },
  {
    "url": "https://slides.com/bahmutov/book-quotes/#/5",
    "lvl0": "Book Quotes",
    "lvl1": "A test deck for practicing scraping slides.",
    "lvl2": null,
    "lvl3": "A Bullet List",
    "lvl4": null,
    "content": "Bullet Two"
  },
  {
    "url": "https://slides.com/bahmutov/book-quotes/#/5",
    "lvl0": "Book Quotes",
    "lvl1": "A test deck for practicing scraping slides.",
    "lvl2": null,
    "lvl3": "A Bullet List",
    "lvl4": null,
    "content": "Bullet Three"
  },
  {
    "url": "https://slides.com/bahmutov/book-quotes/#/5",
    "lvl0": "Book Quotes",
    "lvl1": "A test deck for practicing scraping slides.",
    "lvl2": null,
    "lvl3": "A Bullet List",
    "lvl4": null,
    "content": "This slide has multiple list items, all should be scraped"
  }
]

We can create these records when scraping

const contentSelectors = [
  '.slides .present:not(.stack) p',
  '.slides .present:not(.stack) blockquote',
  '.slides .present:not(.stack) li',
]
const contentSelector = contentSelectors.join(', ')
const textEls = Array.from(doc.querySelectorAll(contentSelector))

const record = { url, lvl0, lvl1, lvl2, lvl3, lvl4, content: null }
if (!textEls.length) {
  return [record]
}

const records = textEls.map((el) => {
  const r = {
    ...record,
    content: el.innerText.trim(),
  }
  return r
})

At the end we still have a flat list of individual records to upload. We can output messages for each scraped slide using cy.log.

scrapeOneSlide()
  .then((r) => {
    const url = r[0].url
    cy.log(url)
    cy.log(`**${r.length}** record(s)`)
    cy.task('print', `${url}: ${r.length} record(s)`)
    records.push(...r)
  })

The recorded video clearly shows the number of text records.

Each slide gets scraped

Watching Cypress browse the slides to scrape them is a lot of fun.

Filtering records

Some text elements should be ignored. For example, often my slides have my Twitter handle @bahmutov on them, or individual URLs. We want to filter such text elements out.

Records to be filtered include the individual URLs and my Twitter handle

We can filter such individual content fields using heuristics: the search records should be longer than 10 letter, should not match a URL regular expression, or match our list of banned words.

Once scraped we also need to ensure the records do not have duplicates. This can happen when the slide has animation blocks - the common text elements will be the same. Thus we make sure all records are unique by taking all content properties together. For example, the following deck of slides shows the new blocks in each slide.

The same slide will be scraped multiple times because its URL changes

The above deck will generate the "content: Open real browser" four times. The other fields like "lvl0", "lvl1", etc are also going to be exactly the same.

[
  {
    "url": "https://slides.com/bahmutov/automated-testing/#/3/0/3",
    "lvl0": "Automated Testing with Cypress.io",
    "lvl1": "This talk shows how quick and simple it can be to write end-to-end tests for web applications – if your testing tools are not fighting you all the time. I will go over writing E2E tests using Cypress.io (https://www.cypress.io/), controlling the network during tests, using visual testing and setting up continuous integration to perform E2E tests on each commit.",
    "lvl2": "Web application",
    "lvl3": null,
    "lvl4": null,
    "content": "Open real browser",
    "objectID": "https-slides-com-bahmutov-automated-testing-3-0-3-4"
  },
  {
    "url": "https://slides.com/bahmutov/automated-testing/#/3/0/4",
    "lvl0": "Automated Testing with Cypress.io",
    "lvl1": "This talk shows how quick and simple it can be to write end-to-end tests for web applications – if your testing tools are not fighting you all the time. I will go over writing E2E tests using Cypress.io (https://www.cypress.io/), controlling the network during tests, using visual testing and setting up continuous integration to perform E2E tests on each commit.",
    "lvl2": "Web application",
    "lvl3": null,
    "lvl4": null,
    "content": "Open real browser",
    "objectID": "https-slides-com-bahmutov-automated-testing-3-0-4-4"
  }
  ...
]

Thus we remove all duplicate records using the text fields concatenated together to check for uniqueness.

export const removeDuplicates = (records) => {
  // often when slides have animations, individual blocks
  // come in one by one. This leads to the text elements
  // on the slide being duplicated.
  // thus we check all records for duplicate content
  return Cypress._.uniqBy(records, (r) =>
    [r.content, r.lvl0, r.lvl1, r.lvl2, r.lvl3, r.lvl4].join('-'),
  )
}

Algolia application

Now let's send the records to Algolia index. I have created a new Algolia application with a new index "quotes".

Algolia application with its new index

Each record to be sent to Algolia needs a "type" property. Since our records all have content field filled, they have the type content. Otherwise, the type is the highest level number (if the record has lvl3, but no lvl3, then it has the type: lvl3). To send the records we can use the official algoliasearch NPM module.

1 2	$ npm i -D algoliasearch + [email protected]

The script file send-records.js loads the records, sets the type and replaces the entire index with the new records.

send-records.js

const { scrapeToAlgoliaRecord } = require('./utils')
const records = require('./records.json').map(scrapeToAlgoliaRecord)

console.log(JSON.stringify(records, null, 2))

// https://www.algolia.com/doc/api-client/getting-started
const algoliasearch = require('algoliasearch')

// tip: use https://github.com/bahmutov/as-a
// to inject the environment variables when running
const client = algoliasearch(
  process.env.APPLICATION_ID,
  process.env.ADMIN_API_KEY,
)
const index = client.initIndex('quotes')
// for now replace all records in the index
index
  .replaceAllObjects(records, { autoGenerateObjectIDIfNotExist: true })
  .then(() => {
    console.log('uploaded %d records', records.length)
  })
  .catch((err) => console.error(err))

The utility function scrapeToAlgoliaRecord moves individual levels into a hierarchy object.

utils.js

/**
 * Converts a scrape record to an Algolia record
 * ready to be send.
 */
const scrapeToAlgoliaRecord = (record) => {
  record.hierarchy = {
    lvl0: record.lvl0,
    lvl1: record.lvl1,
    lvl2: record.lvl2,
    lvl3: record.lvl3,
    lvl4: record.lvl4,
  }

  if (record.content) {
    record.type = 'content'
  } else {
    if (record.lvl4) {
      record.type = 'lvl4'
    } else if (record.lvl3) {
      record.type = 'lvl3'
    } else if (record.lvl2) {
      record.type = 'lvl2'
    } else if (record.lvl1) {
      record.type = 'lvl1'
    } else if (record.lvl0) {
      record.type = 'lvl0'
    }
  }

  // we moved the levels into hierarchy
  delete record.lvl0
  delete record.lvl1
  delete record.lvl2
  delete record.lvl3
  delete record.lvl4

  return record
}

module.exports = { scrapeToAlgoliaRecord }

The final record for a slide with just "Heading 1" could be

{
  "url":"https://slides.com/bahmutov/book-quotes/#/5",
  "content":null,
  "hierarchy":{
    "lvl0":"Book Quotes",
    "lvl1":"A test deck for practicing scraping slides.",
    "lvl2":"The End",
    "lvl3":null,
    "lvl4":null
  },
  "type":"lvl2"
}

The final record with some content could be

{
  "url": "https://slides.com/bahmutov/book-quotes/#/3",
  "content": "It was a pleasure to burn.",
  "hierarchy": {
    "lvl0": "Book Quotes",
    "lvl1": "A test deck for practicing scraping slides.",
    "lvl2": "Fahrenheit 451",
    "lvl3": "Ray Bradbury",
    "lvl4": null
  },
  "type": "content"
}

Once uploaded, the records are searchable from the Algolia UI

Object IDs

Each object in Algolia's application should have a unique ID. Currently we let Algolia assign unique ids to each uploaded record. In the future this approach would not scale. For example, we might have need to replace a record for the given slide presentation - thus we would need to delete some of the records first, before adding new ones. Let's form a unique record ID based on the presentation slug and the slide number.

const slideId = Cypress._.kebabCase(doc.location.href)
// single record
const record = {
  url,
  lvl0,
  lvl1,
  lvl2,
  lvl3,
  lvl4,
  content: null,
  objectID: slideId,
}
// multiple records: add the index
const records = textEls.map((el, k) => {
  const r = {
    ...record,
    content: el.innerText.trim(),
    // give each record extracted from the slide
    // its own id
    objectID: `${record.objectID}-${k}`,
  }
  return r
})

Our object IDs will be something like:

1
2
3

"https-slides-com-bahmutov-book-quotes-0"
"https-slides-com-bahmutov-book-quotes-1-0"
...

Replacing objects

Currently we are using index.replaceAllObjects which removes all objects in the index before adding the updated records. If we have multiple presentations in the index, each slide deck will remove all previous ones. Thus we cannot blindly remove all records.

We cannot simply add new records, even when using the unique object IDs because it might leave "orphan" records in the index. Imagine the following scenario:

we have a long presentation with 100 slides
we scrape the 100 slides into Algolia application
we change the presentation removing 99 slides, leaving just a single slide
we scrape the new presentation with one slide

Hmm, there are 99 records that are still in the index, leading the user to non-existing URLs.

This is why I save the scraped objects and created Algolia records as JSON files before sending them to the Algolia index.

let slug
// derive the presentation slug from the pathname
cy.location('pathname').then((pathname) => {
  slug = Cypress._.kebabCase(pathname)
})
// scrape the slides
.then(() => {
  cy.writeFile(`${outputFolder}/${slug}-records.json`, records)
  const algoliaObjects = records.map(scrapeToAlgoliaRecord)
  cy.writeFile(`${outputFolder}/${slug}-algolia-objects.json`, algoliaObjects)
})

Tip: Cypress command cy.writeFile automatically creates the output folder if one does not exist yet.

We commit the output JSON files to Git, you can find my scraped files in the folder scraped.

We can do the following "trick" before scraping the site: load the previous Algolia records and remove all objects using their unique objectID from the file. That will clear the records for this particular presentation, and we will add the new records after scraping. See the Delete objects documentation.

Alternative method: when adding the new presentation slides we can use the presentation slug as a tag. Then, before adding the new records, we can use the Delete By method to remove any records related to this presentation.

const algoliaObjects = records.map(scrapeToAlgoliaRecord).map((r) => {
  // add the same presentation slug to each record
  // this attribute will be very useful for deleting
  // all old records before scraping the presentation again
  r._tags = [slug]
  return r
})

Now we can delete all slides belonging to a specific presentation, see the delete-records.js script

delete-records.js

const presentationSlug = 'bahmutov-book-quotes'

index
  .deleteBy({
    filters: presentationSlug,
  })
  .then((r) => {
    console.log('deleted records with presentation "%s"', presentationSlug)
  })
  .catch((err) => console.error(err))

The search page

Let's confirm the search works by using a simple HTML page and InstantSearch.js. You can find the full page at index.html

<div class="container">
  <div id="searchbox"></div>
  <div id="hits"></div>
</div>

const searchClient = algoliasearch(
  'MYPSC2284D', // public application ID
  '1d382a9c7cdfa0b2c13664c9a6c75b73', // search-only public API key
)

const search = instantsearch({
  indexName: 'quotes',
  searchClient,
})

search.addWidgets([
  instantsearch.widgets.searchBox({
    container: '#searchbox',
    placeholder: 'Search for book quotes',
  }),

  instantsearch.widgets.hits({
    container: '#hits',
    templates: {
      item(hit) {
        console.log(hit)
        return `<p>${hit.content} - <a href="${hit.url}">${hit.url}</p>`
      },
    },
  }),
])

search.start()

Scraping any URL

Finally, I have refactored the code to make it portable and be able to scrape any Reveal.js deck by just pointing at it via CYPRESS_baseUrl environment variable and running Cypress headlessly. The scraped records are saved as a JSON file for inspection, and also uploaded to Algolia using the cypress/plugins/index.js code. We need to run Cypress with Algolia's app ID and the secret Admin API key to be able to upload the records after scraping.

$ CYPRESS_baseUrl=https://slides.com/bahmutov/slides-dark-mode \
  APPLICATION_ID=... ADMIN_API_KEY=...  \
  npx cypress run --spec cypress/integration/spec.js
...
removing existing records for bahmutov-slides-dark-mode
adding 6 records
  ✓ scrapes (5280ms)

The scraped slides.com/bahmutov/slides-dark-mode has been added to the search index.

Scraped another deck

We can scrape multiple decks calling Cypress with each URL via its NPM module api. You can find the full code at scrape-all.js.

scrape-all.js

const presentations = [...] // all presentation URLs
const cypress = require('cypress')

async function scrapePresentations(urls) {
  if (!urls.length) {
    return
  }

  const presentation = urls.shift()
  console.log(`Scraping ${presentation}`)
  await cypress.run({
    config: {
      baseUrl: presentation,
    },
    spec: 'cypress/integration/spec.js',
  })

  // scrape the rest of the presentations
  await scrapePresentations(urls)
}
scrapePresentations(presentations).then(() => console.log('all done'))

From now on, whenever we create another presentation and make it public, we should run the scrape job to make the deck searchable. You can see the search across my Cypress presentations tagged cypress-introduction and cypress.io at cypress.tips/search page. Here is a typical search

Nice!

Better world by better software

Gleb Bahmutov PhD

Our planet 🌏 is in danger

Act today: what you can do