Apr 23 2021

Scraping the List of Presentation Slides

How to use Cypress to scrape the list of public presentations at slides.com

I love using slides.com to give my presentations, that's why you can see 150 of my public slide decks at slides.com/bahmutov. Lately I am experiencing a huge problem: when someone is asking me a question, or I need to explain a topic, often I know that I gave a presentation that has the right content. But how do I find it? How do I find the right slide?

It is becoming an issue, so let's see what we can do. I have used documentation scraping very successfully before, so I know if I can feed the text contents of the slide decks to Algolia for example, I could quickly find the answers. But unfortunately, Slides.com does not expose the API to grab the slide text and URLs directly. Thus I need to scrape my slide decks myself. Let's do this!

The list of decks markup
Getting slide elements from the test
Use aliases
Save deck information into a file
Scraper the list periodically
Discussion

🎁 You can find the source code for this blog post in the bahmutov/scrape-slides repository.

The list of decks markup

First, we need to grab the list of all my public decks from "slides.com/bahmutov". The list of decks has very nice CSS classes, and by inspecting and trying them in the DevTools console we can find the right one '.decks.visible .deck.public':

We can select all public decks with a CSS selector

This selector returns 126 public decks. Can we grab the main properties of every deck from the DOM element, like the presentation's description, URL, etc? Yes! If you look at the properties of the DOM elements found, then you can locate the dataset property with everything I am interested in:

Presentation properties are stored in the dataset object

Getting slide elements from the test

Let's get the deck information using Cypress. Our configuration file is very bare-bones right now: we only use the baseUrl to directly visit the site

cypress.json

{
  "fixturesFolder": false,
  "supportFile": false,
  "pluginsFile": false,
  "baseUrl": "https://slides.com/bahmutov"
}

Our first test grabs the decks using the selector we found:

cypress/integration/spec.js

/// <reference types="cypress" />
describe('Bahmutov slides', () => {
  it('has decks', () => {
    cy.visit('/')
    // there are a log of slide decks
    cy.get('.decks.visible .deck.public').should('have.length.gt', 100)
  })
})

The test passes.

Decks are found during the test

Can we get the dataset property, let's say from the first presentation? Yes, by invoking the prop method of the jQuery wrapper returned by the cy.get command. Let's run just the second test:

cypress/integration/spec.js

/// <reference types="cypress" />
describe('Bahmutov slides', () => {
  it('has decks', () => {
    cy.visit('/')
    // there are a log of slide decks
    cy.get('.decks.visible .deck.public').should('have.length.gt', 100)
  })

  it.only('has deck dataset', () => {
    // there are a log of slide decks
    cy.visit('/')
    // there are a log of slide decks
    cy.get('.decks.visible .deck.public')
      .should('have.length.gt', 100)
      .first()
      .invoke('prop', 'dataset')
      .then((props) => cy.log(JSON.stringify(props)))
  })
})

The dataset from the first deck element

We are only interested in some properties from the dataset, let's pick them using the bundled Lodash library.

cypress/integration/spec.js

/// <reference types="cypress" />

/**
 * Picks only immutable (mostly) properties from the deck, like
 * when it was created (as UTC string), description, etc.
 * @param {object} dataset
 * @returns object
 */
const pickDeckProperties = (dataset) =>
  Cypress._.pick(dataset, [
    'createdAt',
    'description',
    'slug',
    'url',
    'username',
    'visibility',
  ])

describe('Bahmutov slides', () => {
  it('has decks', () => {
    cy.visit('/')
    // there are a log of slide decks
    cy.get('.decks.visible .deck.public').should('have.length.gt', 100)
  })

  it.only('has deck dataset', () => {
    cy.visit('/')
    // there are a log of slide decks
    cy.get('.decks.visible .deck.public')
      .should('have.length.gt', 100)
      .first()
      .invoke('prop', 'dataset')
      .then(pickDeckProperties)
      .then((props) => cy.log(JSON.stringify(props)))
  })
})

Beautiful.

Use aliases

Let's take a second to refactor our spec file. Every test needs the page, every test needs the list of presentation DOM elements. We can visit the page before each test, or even once using before hook and have all tests work after that:

cypress/integration/spec.js

/// <reference types="cypress" />

/**
 * Picks only immutable (mostly) properties from the deck, like
 * when it was created (as UTC string), description, etc.
 * @param {object} dataset
 * @returns object
 */
const pickDeckProperties = (dataset) =>
  Cypress._.pick(dataset, [
    'createdAt',
    'description',
    'slug',
    'url',
    'username',
    'visibility',
  ])

describe('Bahmutov slides', () => {
  before(() => {
    cy.visit('/')
  })

  it('has decks', () => {
    // there are a log of slide decks
    cy.get('.decks.visible .deck.public').should('have.length.gt', 100)
  })

  it('has deck dataset', () => {
    // there are a log of slide decks
    cy.get('.decks.visible .deck.public')
      .should('have.length.gt', 100)
      .first()
      .invoke('prop', 'dataset')
      .then(pickDeckProperties)
      .then((props) => cy.log(JSON.stringify(props)))
  })
})

Hmm, every test starts with getting the list of deck elements. Can we move the cy.get command to be with cy.visit and save the result into an alias?

// 🔥 THIS WILL NOT WORK, JUST A DEMO
before(() => {
  cy.visit('/')
  // there are a log of slide decks
  cy.get('.decks.visible .deck.public').as('decks')
})

it('has decks', () => {
  cy.get('@decks').should('have.length.gt', 100)
})

it('has deck dataset', () => {
  cy.get('@decks')
    .first()
    ...
})

Unfortunately the above code DOES NOT WORK because aliases are reset before each test, see the Variables and Aliases guide for details. Instead we can visit the page once, and then save the alias before each test by using both before and beforeEach hooks:

cypress/integration/spec.js

describe('Bahmutov slides', () => {
  before(() => {
    cy.visit('/')
  })

  beforeEach(() => {
    // there are a log of slide decks
    cy.get('.decks.visible .deck.public')
      .should('have.length.gt', 100)
      .as('decks')
  })

  it('has decks', () => {
    cy.get('@decks').should('have.length.gt', 100)
  })

  it('has deck dataset', () => {
    cy.get('@decks')
      .first()
      ...
  })
})

Save deck information into a file

Now let's grab the dataset property from each found deck element, and then save the result into a JSON file. I will omit the first two test we have already written, this is the test to write the file using cy.writeFile

/// <reference types="cypress" />

/**
 * Picks only immutable (mostly) properties from the deck, like
 * when it was created (as UTC string), description, etc.
 * @param {object} dataset
 * @returns object
 */
const pickDeckProperties = (dataset) =>
  Cypress._.pick(dataset, [
    'createdAt',
    'description',
    'slug',
    'url',
    'username',
    'visibility',
  ])

const getDeckProperties = (deck$) => {
  const dataset = deck$.prop('dataset')
  return pickDeckProperties(dataset)
}

describe('Bahmutov slides', () => {
  before(() => {
    cy.visit('/')
  })

  // grab all decks before each test because the aliases
  // are reset before every test
  beforeEach(() => {
    // there are a log of slide decks
    cy.get('.decks.visible .deck.public')
      .should('have.length.gt', 100)
      .as('decks')
  })

  it('saves all deck props', () => {
    const decks = []
    cy.get('@decks')
      .each((deck$) => {
        const deckProps = getDeckProperties(deck$)
        decks.push(deckProps)
      })
      .then(() => {
        cy.writeFile('decks.json', decks)
      })
  })
})

Notice how we iterate over the DOM elements, saving the extracted and cleaned up dataset objects in an array to be saved later. The saved file decks.json can be found at the root of the project:

decks.json

[
  {
    "createdAt": "2021-04-09 19:31:53 UTC",
    "description": "In this presentation, Gleb will show how every commit and every pull request can run the full set of realistic end-to-end tests, ensuring the web application is going to work for the user. He will look at the modern CI setup, benefits of clean data environments, and parallelization speed-ups. Anyone looking to learn how awesome the modern automated testing pipeline can be would benefit from this presentation. Presented at BrightTALK 2021",
    "slug": "no-excuses",
    "url": "/bahmutov/no-excuses",
    "username": "bahmutov",
    "visibility": "all"
  },
  {
    "createdAt": "2021-04-01 17:19:27 UTC",
    "description": " Keeping the documentation up-to-date with the web application is hard. The screenshots and the videos showing the user how to perform some task quickly fall out of sync with the latest design and logic changes. In this presentation, I will show how to use end-to-end tests to generate the documentation. By keeping the tests in sync with the application, and by running them on every commit, we will update the documentation, ensuring our users never get confused by the obsolete docs. Presented at TestingStage 2021, video at https://youtu.be/H9VqsTZ9NME",
    "slug": "tests-are-docs",
    "url": "/bahmutov/tests-are-docs",
    "username": "bahmutov",
    "visibility": "all"
  },
  ...
]

Super.

Scraper the list periodically

Before we get into the presentation text search, we need to make sure we can run our list scraping operation periodically. Since our decks.json file can be checked into the source control, let's use GitHub Actions to run our Cypress tests - because the GH Actions have very nice access to the repo and push any changed files back to the repo, see my blog post Trying GitHub Actions for details.

name: scrape
on:
  schedule:
    - cron: '0 3 * * *'
jobs:
  cypress-run:
    runs-on: ubuntu-20.04
    steps:
      - name: Checkout
        uses: actions/checkout@v2
      # https://github.com/cypress-io/github-action
      - name: Cypress run
        uses: cypress-io/github-action@v1
        with:
          record: true
        env:
          # pass the Dashboard record key as an environment variable
          CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }}
          # pass GitHub token to allow accurately detecting
          # a build vs a re-run build
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      # if the decks.json file has been updated by the test
      # commit and push it to the repo
      - name: Commit deck.json if changed 💾
        uses: stefanzweifel/git-auto-commit-action@v4
        with:
          commit_message: Updated decks.json file
          branch: main
          file_pattern: 'decks.json'

Now every night the decks.json will be recreated - and if it changed, then the updated file will be pushed back into the repository.

Discussion

This is just the start, we are scraping the list of presentations as the first step to scraping each presentation's content. By using Cypress to scrape we can see what the algorithm does at each step. If something fails during scraping, we can inspect the screenshots and videos to determine what has changed. Follow this blog to read the future blog posts where we will look at each presentation and how to scrape its content.

For more information, see these blog posts and presentations

presentation Test-Driven Documentation
presentation Testing Your Documentation Search
presentation Find Me If You Can
blog post Scrape Static Site with Algolia
blog post Search across my blog posts and github projects

Better world by better software

Gleb Bahmutov PhD

Our planet 🌏 is in danger

Act today: what you can do

Scraping the List of Presentation Slides

How to use Cypress to scrape the list of public presentations at slides.com

The list of decks markup

Getting slide elements from the test

Use aliases

Save deck information into a file

Scraper the list periodically

Discussion