Build a web-scraped time-series application with AWS CDK in TypeScript — Part 4.2

Build scrapers with CDK: Create our scraper

Previous: Part 4.1. Build scrapers with CDK: Setup bundling with webpack and create our first lambda in TypeScript

It’s time to start scraping!

This project will focus on building web-scrapers with TypesScript (or more accruately JavaScript). I know it is more usual to do that in other languages like Python, but that’s why I would like to dig into the possibility and difficulties of using JavaScript.

Create the record type

First, let’s create our IndexPriceRecord type drafted in previous section:

// src/models/IndexPriceRecord.type.tsinterface IndexPriceRecord {
symbol: string; // Basically an ID
name: string;
lastPrice: number;
change: number;
changeRate: number; // in %
time: string; // Let's use ISOTimestamp
}
export default IndexPriceRecord

Create the scraper

We will use the helper from the library simply-utils, which does the Puppeteer chromnium instance launching work for us and let us focus on the scraping logic.

// src/services/cron/handlers/scrape.tsimport { ScheduledHandler } from 'aws-lambda'
import launchPuppeteerBrowserSession, { GetDataWithPage } from 'simply-utils/dist/scraping/launchPuppeteerBrowserSession'
import IndexPriceRecord from 'src/models/IndexPriceRecord.type'
const getDataFromPage: GetDataWithPage<IndexPriceRecord[]> = async page => {
// Our scraping logic goes here
}
export const handler: ScheduledHandler = async () => {
const results = await launchPuppeteerBrowserSession([getDataFromPage])
console.log(JSON.stringify(results, null, 2))
}

What the launchPuppeteerBrowserSession does is initiating the chromnium instance using the chrome-aws-lambda library and return a puppeteer.Page object for us. The page object is like a cursor of which page you are on in the browser session. You can use it to navigate between different pages then.

So what we need to do is just returning our scraped records from the getDataFromPage function. launchPuppeteerBrowserSession accepts an array of that kind of functions and then it returned the results array with the same sequence.

Our Scraping Logic with the DOM

We have to utilize the page object in our getDataFromPage:

// src/services/cron/handlers/scrape.ts
...
const PAGE_URL = 'https://finance.yahoo.com/world-indices/'
const getDataFromPage: GetDataWithPage<IndexPriceRecord[]> = async page => {
// Navigate to the page we want to scrape
await page.goto(PAGE_URL)
// Wait for some element to be rendered completely
await page.('.yfinlist-table > tbody > tr:last-child > td:last-child')
return page.evaluate((): IndexPriceRecord[] => {
// Perform normal DOM query like on normal browser
})
}
...

Here we first navigate to the page we want to scrape. Then we have to wait for the last cell of the table element to be rendered, as there might be chance the elements are rendered by JavaScript.

Okay, let’s finally hook up our scraping logic:

return page.evaluate((): IndexPriceRecord[] => {
const tableRows: NodeListOf<HTMLTableRowElement> = document
.querySelectorAll('.yfinlist-table > tbody > tr')
/**
* We have to put this inside the
* callback function of page.evaluate,
* because it need to be in the virtual client-side scope.
*/

const parseNum = (text: string): number => parseFloat(text.replace(/\+|,|%/g, ''))
return Array.from(tableRows).map(tableRow => {
// Tell typescript that it is a list of data cells
const cols = tableRow.children as HTMLCollectionOf<HTMLTableDataCellElement>
const symbol = cols[0].innerText
const name = cols[1].innerText
const lastPrice = parseNum(cols[2].innerText)
const change = parseNum(cols[3].innerText)
const changeRate = parseNum(cols[4].innerText)
const time = new Date().toISOString()
return {
symbol,
name,
lastPrice,
change,
changeRate,
time,
} })
})

Now you should be able to scrape an array of IndexPriceRecord items!

Next: Part 5. Create the create-table / update-table handler (COMING SOON)

Based in Hong Kong, a passionate learner and perfectionist, who wants to build something cool and quality.