Feb 18 2022

Incremental Blog Scraping

How to scrape the changed blog posts and upload to Algolia.

In the previous blog post Scrape Static Site with Algolia I have shown how to scrape a static site to make it instantly searchable. After each deploy, you run the scraper which replaces the entire Algolia index with new content. That might work for smaller sites, but quickly runs into Algolia usage limits as the site scales up. I have an Algolia index for all my Cypress blog posts and scraping all blog posts again and again generated way too many record operations, hitting the 10k limit on my free Algolia plan.

The worst thing was - when I publish a new Cypress blog post, the rest of the posts stay unchanged, so all that scraping is replacing the search records with identical records. We need to devise a way to only scrape the new and the changed blog posts. This is what I call "incremental" scraping.

Text records

Another problem one can run into is the number of records created by blog post. At first, I used a CSS selector that returns all paragraphs, list items, and even the code comments to create individual Algolia records

1
2
3

.article .article-inner .article-entry p,
.article .article-inner .article-entry li,
.article .article-inner .article-entry figure.highlight .comment

For a typical blog post like Email Cypress Test Report the above selector returns 20 text records.

A small blog post might generate more than twenty records

In addition, the text records, we scrape the H1, H2, and the blog description, creating hierarchical Algolia records. All this means that a typical post on my blog generates 25 - 80 Algolia records. Multiply by the number of Cypress posts I have written over the years (180+ as of this writing) and each scraping session might use up 9k Algolia operations. For comparison: the free Algolia monthly plan has a limit of 10k operations - and we are hitting it in a single scrape!

Changing the text records

After consulting with the Algolia engineers, I have decided to change how the scraping records are formed. Instead of taking the individual P, LI, and .comment elements, and creating a record for each one, I have decided to combine them all into a single text record. After all - you cannot individually target a P record. My blog post only has anchor links for the header elements, thus all the P, LI, and other elements between them could be a single Algolia text record with the header anchor.

Our goal is to have a single text record between the headers

Unfortunately, the default Algolia scraper code does not allow merging the text records into one large record before uploading. Thus I needed to implement something myself. Luckily, Cypress can load the page and process it in every possible way. Time to write our own scraper.

Scraper inside Cypress

🎁 You can find my source code used to incrementally scrape the blog posts in the repository bahmutov/scrape-blog-post-page.

Using the cy.get command we get the individual records, including the headers. Then we combine them into largr text records, up until we see a header element. Then we start a new text record. This is done by Cypress in cypress/integration/utils.js file

cypress/integration/utils.js

export function hasAnchor($el) {
  const $anchor = $el.find('[id]')
  return $anchor.length > 0
}

export function getAnchor($el) {
  const $anchor = $el.find('[id]')
  return $anchor.attr('id')
}

// now process all article elements, grouping them by the anchor
return cy
  .get(
    `
      header.article-header h2,
      article .article-inner h2,
      .article .article-inner .article-entry p,
      .article .article-inner .article-entry figure.highlight .comment
    `,
  )
  .each(($snippet) => {
    if (hasAnchor($snippet)) {
      const anchor = getAnchor($snippet)
      currentRecord = {
        anchor,
        text: '',
      }
      records.push(currentRecord)
    } else {
      currentRecord.text += '\n' + $snippet.text().replace(/\s+/g, ' ')
    }
  })

Whenever we see a header element with an anchor, we start a new text record. All the individual P and .comment records after that are appended as text, forming one large chunk of text (the code later checks the total text record size, Algolia recommends the records to be below 10k or 100k bytes). Thus instead of 20 individual text records, the "Email Cypress Test Report" blog post will have just 3 text records plus a few header records.

Typical text record after scraping

A typical Algolia record has one or several paragraphs of text, an anchor, and the full URL that uses the anchor to immediately direct the user to the right place. Here is the user searching using a part of the text above. You can try searching yourself by going to cypress.tips/search

When the user clicks on the search result, they are directed to the section's URL.

The click on the search result takes you to the section

I use Cypress to scrape, even if something like cheerio.js is faster. I can see and debug the scraper much better by using the Cypress GUI. I can see each found DOM element (including using at the DevTools Elements panel), step through the code, save the intermediate records - all to understand what the scraping algorithm is doing. If I wanted, I could then transfer the scraper logic into a Node script using cheerio.js. But as you will see in the next section, there is no need to optimize the speed of the scraper at the expense of the debugging experience - because we will scrape very few posts at a time.

Observing the records being merged in Cypress browser

The merged text and header records are then uploading to Algolia using the code inside the Cypress plugins file.

cypress/plugins/index.js

const algoliasearch = require('algoliasearch')
module.exports = (on, config) => {
  on('task', {
    // upload scraped records to Algolia
    async uploadRecords({ records, slug }) {
      const { APPLICATION_ID, ADMIN_API_KEY, INDEX_NAME } = process.env
      if (!APPLICATION_ID || !ADMIN_API_KEY || !INDEX_NAME) {
        console.log('Algolia app/key not set')
        console.log(
          'Skipping uploading %d records for slug %s',
          records.length,
          slug,
        )
        return null
      }

      const client = algoliasearch(APPLICATION_ID, ADMIN_API_KEY)
      const index = client.initIndex(INDEX_NAME)

      console.log('%s: removing existing records for %s', INDEX_NAME, slug)
      await index.deleteBy({
        filters: slug,
      })

      console.log('%s: adding %d records', INDEX_NAME, records.length)
      // each record should have a unique id set
      await index.saveObjects(records, {
        autoGenerateObjectIDIfNotExist: true,
      })

      // cy.task must return something
      return null
    }
  })
}

Note that if a blog post has been edited, we need to remove any existing records, which I do by using the post slug

1 2	// take the last part of the url which is the post name const slug = _.last(_.filter(_.split(baseUrl, '/'), Boolean)

Now we just need to make sure we only scrape the changed and the new blog posts.

Incremental scraping

On my Hexo blog every published blog post has its "lastmod" date, which you can find in the sitemap.xml

Blog post URLs and last modified dates

We can get the list of Cypress blog posts from the /tags/cypress/ page.

A page with all Cypress blog posts

Tip: if you use the pagination in Hexo blog, then the "tag" page only shows the first N blog posts for a tag. I have removed this limit by cloning the pagination plugin which you can find at bahmutov/hexo-generator-gleb.

Getting the list of URLs from the tag page is simple to do using cheerio.js in get-post-urls.js

get-post-urls.js

const cheerio = require('cheerio')
const { URL } = require('url')

async function getBlogPostUrls() {
  const tagPageUrl = 'https://glebbahmutov.com/blog/tags/cypress/'

  const res = await got(tagPageUrl)
  const $ = cheerio.load(res.body)

  // find all tags pointing at the blog posts
  // and put the full URL in the array
  const links = []
  $('a.archive-article-title').each(function (k, el) {
    const relativeUrl = $(el).attr('href')
    const fullUrl = new URL(relativeUrl, tagPageUrl)
    links.push(fullUrl.href)
  })

  console.log('found %d links', links.length)
  return links
}

Any time we want to get the list of blog posts to scrape, we can get the sitemap, parse it into URLs and the last modified dates. At the same time we get the Cypress post URLs and intersect the two lists.

get-modified-post-urls.js

const fs = require('fs')
const { getBlogPostUrls } = require('./get-post-urls')
const { getSiteMap } = require('./get-sitemap')

Promise.all([getSiteMap(), getBlogPostUrls()]).then(([modified, posts]) => {
  const cypressPostsWithModified = {}
  posts.forEach((url) => {
    if (!modified[url]) {
      console.error('missing modified date for %s', url)
    } else {
      cypressPostsWithModified[url] = modified[url]
    }
  })

  const filename = 'blog-post-urls.json'
  fs.writeFileSync(
    filename,
    JSON.stringify(cypressPostsWithModified, null, 2) + '\n',
  )
  console.log(
    'saved %d links to %s',
    Object.keys(cypressPostsWithModified).length,
    filename,
  )
})

I like saving the intermediate results as JSON files, because that allows me to inspect the data, and continue from a known state. Now we need to decide for each URL if it needs scraping. At first, I tried to use Algolia to tell me the scraped timestamps, but later decided to simplify the logic and just have a database of records and the scrape timestamps. I have created was-it-scraped NPM module to abstract saving which records were scraped already. Under the hood it uses an external Supabase database, but you could use a local JSON file as well. Now we can write a script to filter all the found Cypress blog posts and only leave the ones that need scraping:

filter-scraped-posts.js

const fs = require('fs')
// an object with [url] => [modified]
// with modified string in the format "YYYY-MM-DD"
const allPosts = require('./blog-post-urls.json')
console.log(
  'checking %d posts if they are scraped',
  Object.keys(allPosts).length,
)

const { wasScrapedAfter } = require('was-it-scraped')

async function checkScrapeStatus(urlsModified) {
  const urls = Object.keys(urlsModified)
  const results = []
  for (let url of urls) {
    const modified = new Date(urlsModified[url])
    const scraped = await wasScrapedAfter(url, modified)
    if (!scraped) {
      results.push(url)
    }
  }

  return results
}

checkScrapeStatus(allPosts).then((list) => {
  const filename = 'need-scraping.json'
  fs.writeFileSync(filename, JSON.stringify(list, null, 2) + '\n')
  console.log('saved %d links to be scraped into %s', list.length, filename)
})

A typical run quickly goes through the hundreds of URLs to only find the new and the modified blog posts.

Filtering all URLs by the last scraped vs modified dates

In the above run, only a single blog post URL will require scraping.

The saved JSON file lists a single blog post URL that needs scraping

The last Node script goes through the list of URLs to scrape and fires up Cypress via its NPM module API. After scraping it marks the last scraped timestamp in the database for those blog posts using the was-it-scraped module and its markScraped function.

scrape-filtered-posts.js

const { markScraped } = require('was-it-scraped')
const path = require('path')
const fs = require('fs')
const cypress = require('cypress')
const urlsToScrape = require('./need-scraping.json')

console.log('to scrape %d posts', urlsToScrape.length)

async function scrapeOnePost(url) {
  const outputFolder = 'scraped'
  const slug = url.split('/').filter(Boolean).pop()
  console.log('scraping url %s, slug %s', url, slug)
  const outputRecordsFilename = path.join(
    outputFolder,
    `${slug}-algolia-objects.json`,
  )

  await cypress.run({
    config: {
      baseUrl: url,
    },
    env: {
      slug,
      outputRecordsFilename,
    },
    spec: 'cypress/integration/spec2.js',
  })

  const records = JSON.parse(fs.readFileSync(outputRecordsFilename))
  await uploadRecordsToAlgolia(records, slug)

  await markScraped(url)
}

async function scrapeUrls(urls) {
  for (let url of urls) {
    await scrapeOnePost(url)
  }
}

scrapeUrls(urlsToScrape)

Continuous incremental scraping

We can do the scraping locally, but a more consistent way is to let the CI run the scraper every night. I am using GitHub actions to call the above scripts, see the code in the .github/workflows/scrape.yml file.

.github/workflows/scrape.yml

name: Scrape
on:
  push:
    branches:
      - main
  schedule:
    # scrape any new blog posts every night
    - cron: '0 1 * * *'
jobs:
  scrape:
    runs-on: ubuntu-20.04
    steps:
      - name: Checkout 🛎
        uses: actions/checkout@v2

      # Install NPM dependencies, cache them correctly
      # https://github.com/cypress-io/github-action
      - name: Install dependencies 📦
        uses: cypress-io/github-action@v2
        with:
          runTests: false

      - name: Get modified posts 📰
        run: node ./get-modified-post-urls

      - name: Filter scraped posts 📤
        run: node ./filter-scraped-posts
        env:
          SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
          SUPABASE_KEY: ${{ secrets.SUPABASE_KEY }}

      - name: Show list of posts to scrape 🖨
        run: cat need-scraping.json

      - name: Scrape the changed posts 🚀
        run: node ./scrape-filtered-posts
        env:
          # for updating the scraped timestamps
          SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
          SUPABASE_KEY: ${{ secrets.SUPABASE_KEY }}
          # for updating the Algolia index
          APPLICATION_ID: ${{ secrets.APPLICATION_ID }}
          INDEX_NAME: ${{ secrets.INDEX_NAME }}
          ADMIN_API_KEY: ${{ secrets.ADMIN_API_KEY }}

The above workflow is fast. For example, a recent run with one blog post to scrape took 35 seconds.

Running the scrape workflow on GitHub Actions

When scraping the blog post, Cypress outputs the main messages about the scraping progress.

Scraping the blog post and uploading the records to Algolia

Even this blog post will be scraped automatically, as it tagged "cypress" too. And here it is - scraped by the CI 🎉

This blog has been scraped

Better world by better software

Gleb Bahmutov PhD

Our planet 🌏 is in danger

Act today: what you can do