According to Taweh Beysolow, “Natural Language Processing (NLP) is a subfield of computer science that is focused on allowing computers to understand language in a ‘natural’ way, as humans do.” NLP has evolved so rapidly gaining traction in its applications inn artificial intelligence (AI). Some of the common applications of NLP include:

  • Sentiment Analysis – Used to detect sentiments or emotions expressed in a text.
  • Text Summarization – Used to create a summary of a document.
  • Speech Recognition – Used to translate speech into text.
  • Text Classification – Used to create text categories based on their content and features.
  • Development of Chatbots – Used to create software with conversational attributes of humans.
  • Automatic Text Generation – Used to predict upcoming text in sentences.

In this project, we will explore one of the most exciting NLP applications i.e. Sentiment Analysis. We will build a machine learning model that can categorize tweets as positive (pro-vaccine), negative (anti-vaccine) or neutral. Stay tuned and let’s jump into the project. It will be a three-part series tutorial where this article will deal with Text Processing (data cleaning), the second article will be on Data Visualization and Exploration while the last one will cover Feature Extraction and Building the model.

Project: To Vaccinate or Not to Vaccinate: It’s not a Question

Before we dive into coding, let’s look at what the project is all about and the problem we have at hand. This project is adopted from Zindi, a data science and AI platform that provides AI enthusiasts with an opportunity to apply their skills in solving problems using machine learning and AI through data science hackathons and competitions. I participated in the hackathon during the #ZindiWeekendz and I would like to share how I did my work.

Problem Statement

The task was to create a machine learning model to classify a twitter post related to vaccination as positive, negative or neutral. This was meant to help governments and public health actors to assess the sentiments of the public on COVID-19 vaccinations and help improve public health policy, vaccine communication strategies, and vaccination programs across the world.


The dataset was provided with both train and test sets. The data and data description can be found HERE. Feel free to download and follow along as we explore this exciting project.

Setting up the workflow

It is always important to understand the task at hand and decide how you will go through it by highlighting the process and procedures. First and foremost, this is a supervised machine learning task with a classification problem where our goal is to classify sentiments as 1: Positive, 0: Neutral, and -1: Negative.  This is because we have been given both labeled dataset (train set) on which we are supposed to train our model and unlabeled dataset (test set) on which to test the performance of the model. We will, therefore, use supervised machine learning algorithms. We will also use text processing techniques to prepare our dataset since we have text data.

In this tutorial, I will be using Jupyter notebook with Python3. Feel free to use your preferred IDE.

You will also need to import required modules and libraries for your project. For this project, I imported the following

I imported all of them at the beginning so as to keep my notebook clean.

Let’s dive in

We will begin our work by loading the datasets provided and doing some basic checks on our data as follows.

The output is as follows:

It is also important to read through the data description file to thoroughly understand what each dataset is and what each column represents. A deep understanding of your dataset is a step towards success in building a good machine learning model.

Let us now check the shape of each dataset and columns.

We can note that our train set has 1001 samples and 4 features while test set has 5177 samples and 2 features. A closer look into train set shows that tweet_id and safe_text columns are in object format while label and agreement columns are in float64 formats.

Text Processing

We will now get into preparing our data for the model. Here we will combine our datasets by appending test set on train set and the perform the following;

  • Change the safe_text column to lower case
  • Remove twitter handles and other unwanted characters
  • Remove punctuation marks
  • Remove stop words

Now, let’s combine the datasets.

  1. Change text to lower case

Next we will change the safe_text column to lower case. We want to have everything in lowercase in order not to find words in both upper case and lower case. The code to perform this task is shown below.

From the output we can see that our newly created column, clean_safe_text, has everything in lower case as follows;

2. Remove twitter handles

Let us now remove twitter handles marked as <user> from out clean_safe_text column using the following function.

3. Remove punctuation marks and other special characters

We will proceed with our data cleaning by removing punctuation marks, numbers and other special characters. The code for this task is shown below.

Comparing safe_text column with clean_safe_text columns, we can see that we have successfully removed HTML tags like <user>, punctuation marks and other special characters. Let us now remove stop words.

4. Remove stopwords

Words such as ‘this’, ‘my’, ‘am’, ‘any’ may not provide valuable information in classification of the tweets. There words are generally referred to as stopword. These words increase the number of vectors that may slow down the training process and also ultimately make the model underperform. We will therefore remove the stopwords using nltk module.

First let us see the English stopwords available using the following code.

So our task now is to remove all those stopwords from the clean_safe_text column.

5. Tokenization

Basically, tokenization involves splitting a phrase, sentence, paragraph or document into individual words, called tokens. This is important because the meaning of a phrase, sentence or paragraph can easily be found by analyzing the tokens. The following code shows how we will tokenize our clean_safe_text column.

6. Stemming

Stemming is the process of extracting root word from words and phrases. For example, the root word for ‘hopeful’ and ‘hopeless’ is ‘hope’. Stemming is usually applied in NLP to reduce the number of vectors used to create the model. This is because without stemming, we will have many words with the same meaning that would otherwise be reduced to one. It is therefore important to do stemming for your data. There are different word stemmers such as PorterStemmer and SnowballStemmer. In this case, we will use PorterStemmer. Here is the code:

The other alternative for Stemming is Lemmatization. Lemmatization is the process of grouping different forms of words so that they can be analyzed as a single term. It’s is different from stemming since it does morphological analysis of the words. Lemmatization considers the context of the words. You can check this out and try it.

That is all for text processing. Text processing is a very important step in building an NLP model. In the next article, we will look at data visualization and exploration. Happy learning!