...

Web Scraping Tutorial

Web scraping is an important tool to aggregate data from the Internet. It is true that is always recommended to gather data from datasets and APIs. However, given that a lot of data is available online for free and APIs can be costly, having a scraping algorithm may come in handy.

With an API, you just read the given documentation, authenticate with your token, and send your request through simple lines of code and the data comes out clean in JSON format. On the other hand, web scraping is the “Rambo” of the situation, you need to go out in the jungle, hide in the bushes and catch your prey. No, I am kidding but it is free, since it gets data directly from the browser and automates many operations like copy, paste, selecting data, clicking next page, hovering, bypass cookie banners and captchas.

While learning to webscrape may be slightly more demanding than reading and implementing an API documentation, it pays off in the long run, since every API had its own documentation to read and learn. The downside is that when the websites changes formatting, the webscraping algorithm has to be updated to remain effective.

Ok so let’s begin.

Planning

For this example, I want to extract all the values in the table located on this website: Milano area, company reports (https://reportaziende.it/Milano)

It is simple publicly-available information about companies in Milan, it holds information about revenues and locations of companies. Scraping does not unlock secret or protected data, it just helps you collecting information that is public, faster.

If the website I used in the example is not available at the moment you will try this tutorial, you can use it with every other website that contains a table element in its HTML <table></table>. Or even better on a website you already need the data from.

Our final algorithm should do the following:

  • Extract all the information from the main table.
  • Hover on each line so that the dynamic table on the right updates and extract that data too per every line.
  • Click on the “Next page” button, wait until the table updates and repeat extraction until the last page.
  • Save everything in a spreadsheet format.

Step 0. Setup

For this project, we will be using the package “Puppeteer” for node.js since it is the most capable webscraping package allowing us to simulate a browser window in the backend and scrape from dynamically-updated pages, which are the norm nowadays. Other advantages of using node.js are the asynchronous features that let us run multiple operations simultaneously, and the fact that Javascript is becoming the main language for full stack developers.

If a website does not use any dynamic rendering with AJAX, you can use simpler packages like Cheerio. However, most websites nowadays are not static, so puppeteer allows to emulate a browser instance, wait and interact with the page like a human would, and that’s why we use it.

First of all, we create a new folder, I’ll name it “scraper.” Then, I open it with my code editor, for simplicity I use Visual Studio Code.
Then, I initialize a simple node.js project inputing the following in the VS Code Terminal (!).

npm init -y

If it does not work, it means you did not install either VS Code or Node.js on your computer, please do so by clicking on the links.
Let’s resume. We install the node packages we’ll be needing for the tasks, in our case: puppeteer, fs (to interact with system files) and xlsx (for saving data in excel format).

npm install puppeteer xlsx

So now that we have a project in node.js, let’s create a javascript file, just create a new file with VS code and rename it script.js. Once created let’s import the modules required for this project with the following commands (not anymore in the terminal, but on the .js file):

const puppeteer = require('puppeteer');
const fs = require('fs');
const XLSX = require('xlsx');

You should always check the website limitations for automatic requests, located at www.domain.com/robots.txt, in our specific case: https://www.reportaziende.it/robots.txt but there are no specifications for limits, just “Disallow:” which are instructions for search engines like Google to not crawl and index the specified pages. So green light, we can use the website as an example for our algorithm.

Anyways, it is important to always be respectful and keep your requests to about 1 per second to not overload the servers.

For troubleshooting, I recommend to use chatGPT for a quick fix. However, given the data cutoff, most answers are not up to date (in software development a lot of stuff can happen in 1 or 2 years), so I recommend googling the error to find a more updated solution.

Step 1. Scraping the Data from the Table

Checking the website’s HTML with the “inspect” function of your browser you can notice that the table elements are structured as follows:

<table>
  <thead>                 <!-- This tag holds the headers of the table -->
    <tr>                  <!-- tr holds a row of elements -->
      <th></th>           <!-- th holds one column label -->
      <th></th> <th></th> <th></th> <th></th>
    </tr>
  </thead>
  <tbody>                 <!-- This tag holds the rows with the elements of the table -->
    <tr>
      <td></td>           <!-- td holds one element -->
      <td></td>
      <td></td>
      <td></td>
      <td></td>
    </tr> <tr>
      <td></td> 
    
    <!--  ... other 24 rows   --> 

    </tr>    
  </tbody>
</table>

We need our puppeteer script to enter the website and gather this data quickly, we will use these tags as anchors to select and scrape the elements.

I will put comments to make it more understandable, using the javascript “//” notation.
In this snippet, we start the browser, go to the page and take a screenshot to check whether we made it.

// asynchronous function
(async () => {  const browser = await puppeteer.launch({headless: 'new'});     // puppeteer starts browser
  const page = await browser.newPage();                          // puppeteer selects page
  
  // "await" command is to tell the software to wait before executing the next line
  await page.goto(`https://reportaziende.it/Milano`);        // we get puppeteer to our URL
  await page.screenshot ({path: `example.png`});             // takes a screenshot
  


  // We will write the rest of the script HERE  })();

As it often happens, things do not go the way we want. Cookie Banner!

To take it out of the way there is an easy solution: to automate the clicking on the “Accept All” button and through inspection of the HTML, we learn that this button has “pt-accept-all” as an id.

// Select the "Accept All" button using the CSS selector
const acceptButton = await page.waitForSelector("#pt-accept-all", { timeout: 3000 }).catch(() => null);             // IF waitForSelector does not find it in 3seconds, no errors

if (acceptButton) {              // If there is a banner with a button, it clicks on it                    
  await acceptButton.click();
} else {
  console.log('No cookie banner found, continuing...');
}

Ok, so now I want to loop through the table and extract the elements in a variable called tableData:

// Extract elements inside table
// .evaluate makes the code run within the context of the puppeteer browser page
const tableData = await page.evaluate(() => {

  // document is the DOM (the HTML code) and querySelector method finds what's in the tags
  // this selects the <tr></tr> element 
  const headerRow = document.querySelector('thead tr');      
  // it creates an array selecting the text within the <th></th> objects
  const headers = Array.from(headerRow.querySelectorAll('th')).map(th => th.textContent.trim());    // .map loops through the 'th' elements and applies formatting
  
  // in the same way as before, we select the rows in the body of the table
  const rows = Array.from(document.querySelectorAll('tbody tr'));
  
  // this returns all the data in the rows as the tableData object
  return rows.map(row => {
    const rowData = {};              // for each row an empty dictionary is created
    const columns = row.querySelectorAll('td');    // each element in the grid is selected
    
    // pairs headers as "keys" and elements as "values"
    columns.forEach((column, index) => {  rowData[headers[index]] = column.textContent.trim();
    });
   
    // each rowData is formatted as a dictionary and tableData is an array of dictionaries
    return rowData;
  });
});

So far, so good, right now we have a script that gets past a cookie banner and extracts values from a page. Sometimes the page requires a bit to load so it is good to wait a couple of seconds before proceeding any further, we can do it adding this function:

// waits for specified amount of milliseconds
function sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
}

// within the async function we use it as follows
await sleep(3000);

Adding a console.log(rowData) before returning, will output this data in this format if everything is correct:

{
 "#":"",
 "Azienda":"ENGIE ITALIA SPA",
 "Città":"Milano",
 "Prov.":"MI",
 "Fatturato":"€ 13'686'802'590"
}

You could have done it manually with a click, copy/paste and a bit of manual formatting. So what’ the point? Before degenerating in a spiral of nihilism and self-loathing, let’s remember this stuff has to give us a concrete benefit. In this case, it’s to reduce our manual operation and get the data we need quickly and automatically.

Step 2. Collect the Data from all the Tables

As we can see, there are 40 pages of 25 rows to copy and the way from passing from one page to the next is by clicking the “Prossimo” button, which means “Next” in Italian. So the most obvious conclusion is to automate the click of this button and repeat the scraping for every page.

I will loop through the pages and the loop will contain the scraping algorithm we have seen before, a line of code to stack the tables together and a function that clicks on the next button and gets us to the next table. This is the function:

// Asynchronous function that clicks on the Next Button
async function clickNext(page) {
  const nextButtonXPath = `//*[starts-with(text(), "Prossimo")]`;    // XPath of the button

  // .$x() tells puppeteer to select one item given its XPath
  const nextButton = await page.$x(nextButtonXPath);    await nextButton[0].click();
}

So far we have used CSS selectors, but in this case, we just want to select one single unique item and the XPath makes it easier. So what is an XPath? It’s an XML Path selector, we use it because it allows us to select something with higher precision than a CSS selector. The drawback is that it’s more complex and some browsers may process it more slowly.

Anyways, let’s continue with the code. So now we wrap our previous table extraction algorithm in a function:

async function extractTableData(page) {
  return await page.evaluate(() => {
    const headerRow = document.querySelector('thead tr');
    const headers = Array.from(headerRow.querySelectorAll('th')).map(th => th.textContent.trim());
    const rows = Array.from(document.querySelectorAll('tbody tr'));
    return rows.map(row => {
      const rowData = {};
      const columns = row.querySelectorAll('td');
        rowData[headers[index]] = column.textContent.trim();
      });
      return rowData;
    });
  });
}

In the end, we wrap all the page extractions in a for loop to repeat for 40 pages at a time.

// declare variable that will hold all the data
let data = [];

// javascript for loop, 40 iterations starting by 1 
for (let i = 1; i < 41; i++) {

  // wait 3s for the page to load
  await sleep(3000);
 
  // table extraction algorithm
  const tableData = await extractTableData(page); 

  // adds data from each table to the main data variable
  data.push(...tableData);
  
  // go to next page
  clickNext(page);
};

After this, the data variable should hold all the data we want to use in our spredsheet.

Step 3. Convert and save in Excel

This is quite simple now, we just have to polish the JSON file a bit. So we convert the data into JSON format and remove some useless rows. This website adds an empty one at the end of each page that looks like this {“#”:””}. So we can get rid of it with a find and replace function.

// Convert the data in JSON format
let jsonData = JSON.stringify(data);

// function to find and replace a string with another
function findReplaceAll(inputString, find, replaceWith) {
    const regex = new RegExp(find, 'g');
    return inputString.replace(regex, replaceWith);
}

// clean JSON file removing "{"#":""}," and "{"#":""}"
jsonData = findReplaceAll(jsonData, '{"#":""},', '');
jsonData = findReplaceAll(jsonData, ',{"#":""}', '');

Then, finally we convert it and save it as a spreadsheet using xlsx.

// create a new workbook
const wb = XLSX.utils.book_new();

// convert JSON to worksheet
const ws = XLSX.utils.json_to_sheet(JsonData);

// add the worksheet to the workbook
XLSX.utils.book_append_sheet(wb, ws, 'Sheet1');

// write the workbook to an XLSX file
XLSX.writeFile(wb, `./output/report.xlsx`);

Finally, the sheet is workable and you can browse through the companies like a charm.

Conclusion

This is just a beginning to understand webscraping and implement it. Even though, it is legal and the rate of this algorithm is laughable, not even enough to slightly slow a website down, most organisations are hostile to webscraping. There are turnarounds for the script to not get blocked eventually, but these methods are not further discussed here.

You can build on this data, integrating it with scraping from other websites and datasets. The more you aggregate, the more meaningful it becomes. For example, you can scrape additional information from every single company page or integrate with other website information such as linkedin or glassdoor. I cover some advanced scraping in the second part of this tutorial.

Anyways, it was a cool learning experience. I remember I did this manually a couple of years ago and it took forever, so the benefits are definitely there.
If you have any tips about doing it better, some questions about the tutorial or just want to start a conversation on the topic, feel free to write below.

Cheers!

Leave a Comment

Your email address will not be published. Required fields are marked *

Seraphinite AcceleratorOptimized by Seraphinite Accelerator
Turns on site high speed to be attractive for people and search engines.