Web scrapping

Web scraping is an automated process of extracting data from a website into a new format. These formats include CSV (comma-separated values), JSON (JavaScript object notation), excel, etc. Websites have tones of both structured, semi-structured, and unstructured data. Web scraping is important in getting data that can be used for data projects. In this article, I will explain and demonstrate using code, how to scrape Wikipedia tables. I will use a very simple and fun way to do this.

One of the tasks of a data professionals is to have enough data that can be used in different data projects. Getting the right data at the required time is a very critical aspect of a data-driven project. For instance, for a machine learning engineer to be able to build a feasible model, he/she needs data. Likewise, for a BI developer to be able to provide valuable insights, he/she needs data. Furthermore, one of the ways of building data skills for any aspiring data scientist/analyst is being able to work on projects. For this to happen, they need real-world data.

There are instances where data is provided from various data collection exercises. But in some cases, one may need to build his/her own data sets. One of the best ways of building these datasets is by web scraping. So, let us journey together and learn one of the simple ways of building structured datasets from Wikipedia. We will use this web page to extract all the tables, combine them and save them in our local machines.

Requirements and Steps.

Here, we will require pandas library and pandas.io.html to read the HTML pages and extract the required tables. The general process on how to scrape Wikipedia tables will be as follows:

  • Importing required libraries: Here we will import pandas for data manipulation and read_html parser to be able to extract the tables
  • The second step will be to inspect the HTML code, identify the table classes, and extract tables from the webpage. We will also be able to print the number of tables that we have extracted from the URL.
  • After that, we will concatenate all the tables to form one dataframe.
  • Finally we will save our dataframe as a CSV file.

These few steps will let us move from viewing the tables in a webpage to saving the dataframe in our local machines.

Coding

Now that we have looked at the requirements and the steps we will follow, let’s get our hands into the works and build a scraper for Wikipedia tables. For purposes of simplicity, I will show the steps in separate code snippets and finally show the full code.

Let’s import the libraries as shown below;

# import required libraries

import pandas as pd
from pandas.io.html import read_html

The read_html parser will help us extract the HTML tables from the web page. We will use this URL to get all the tables in it.

Before we proceed to extracting the tables, we will inspect the HTML structure of the page and identify tags where the tables are found. So in this case, right-click one of the tables you want to scrape and click Inspect. This will open an HTML code with tags for the contents of the webpage.

The figure above shows the HTML code on the right side of the page. Our interest is to identify the <table class> tag of the table. In this case, it is “wikitable sortable” . We will use this class in our code to extract the table as shown below.


# extract all tables on the page

url_page = 'https://en.wikipedia.org/wiki/List_of_universities_in_South_Africa'
tables = read_html(url_page, attrs={'class':'wikitable sortable'})

Let us proceed and find out how many tables we were able to extract from the page. We will use the code below;

# Print the number of tables extracted from the page

print('Extracted {num} tables'.format(num=len(tables)))

This code returns “Extracted 6 tables“. In fact, these are the tables on our web page.

Visualize the tables

Let us now visualize the top rows of each table one by one. Note that the tables are from index 0 to 5. To do this, we will use the following code:

# View the first 5 rows of the first table. 

tables[0].head()

Output:

You can use the same code to view all the tables by just changing the index. For instance to view the second table, we will do the following:

Try and view all the tables by customizing the code.

Let us now concatenate the tables to form one dataframe. This is so simple and is done as follows;

# concatenating tables

df = pd.concat([tables[0],tables[1],tables[2],tables[3],tables[4],tables[5]], axis=0)
df.head()

Output:

Wow, we have done it!

Saving the dataframe

Finally, let us save our dataframe as a CSV file and name it “Universities.csv“.

# save the dataframe as CSV file

df.to_csv('Universities.csv', index=False)

Great! In summary, we have gone through the entire process on how to scrape Wikipedia tables, concatenating and saving them as a CSV file. From here, you can go ahead to clean and visualize it as you wish. I hope this tutorial was useful. Get the full code on my GitHub account here.