This second article of a series of articles on Twitter Sentimental Analysis from the first one I did on text processing, explores another interesting stage of machine learning project called data exploration and visualization. Data exploration involves taking a deeper look at the data to understand the structure, features, and relationships. It requires a closer look into the data beyond the general structure. Data exploration helps in determining the possibility of missing values and how to handle them, availability of outliers, and recommending the best way to treat them, knowing the relationship between features, among others. In this stage, one needs to be very analytical and inquisitive so that more benefits can be drawn from the exploration.
Data visualization on the other hand is the process of representing data in a visual format. It is important in a machine learning project because it enables one to be able to get more insights about the data by exploring distributions, relationships and descriptive statistics about the features. A better understanding of the data goes a long way in building the model. There are a good number of great libraries in python that can be used in data visualization such as matplotlib, seaborn, plotly, cumfly, among others.
Here we will use our cleaned dataset to do the exploration and create nice visualizations that will help us in gaining useful insights for our NLP model. Let’s dive in.
We will start with basic exploration of the structure of our dataset. We will begin with finding out whether there are missing values, then we will look at the shape of our dataset. The code for these operations is shown below;
We notice that our data has no missing values. If our data had missing values, we would have decided on the best way to treat this scenario. Some of the ways of handling missing data ranges from as simple as deletion to more complex imputation using machine learning models. This process depends on a number of factors and therefore cannot be accorded blanket treatment for all the datasets. I have written an article on some of the ways of handling missing values in your data. Feel free to look at it.
It’s also worth noting that our data has 15,175 rows and 5 columns.
Length of the tweets
Let us now find out the length of our clean_safe_text column. We want to determine how long the tweets are. We will create a new column called clean_safe_text_length to hold the length of each tweet as follows;
Great! We can now see the length of each tweet on the clean_safe_text column. Remember at this point we had removed stopwords and other characters when we were doing text processing. It means that the raw tweets were obviously longer than this.
Distribution of the target variable.
It will be important to know the distribution of our target variable, i.e. label. Note that this column is only found in our train data, however, we will use the whole dataset to check the distribution using value_counts() function as follows;
From the dataset, the target variable, the labels are not uniformly distributed. This is because 0 (neutral), and 1 (positive) have above 4,000 cases while -1 (negative) has about 1,000 cases.
Visualizing Common Words
In our last article, we tackled Stemming where we transformed the words used in the text to their root words. We now want to visualize the most common words in the tweets. We will begin by visualizing all the words and then we will proceed to visualizing tweets labeled as pro-vaccine, anti-vaccine and neutral tweets respectively. We will use a python library called wordcloud. wordcloud is a great way of displaying how popular a word is in a text document. The more prominent the word is, the bigger it is displayed. It is a powerful tool in exploring text data for easy analysis and for making reports cool and appealing.
The code and the output for visualizing the various labels is as shown below.
i. Common words in all tweets
ii. Common words in Pro-vaccine tweets
iii. Common words in Anti-vaccine Tweets
iv. Common words in Neutral Tweets
This is cool. We can now see the most common words in each of the labels of the data. It helps us know how the tweets look like. We will then proceed and visualize the hashtags used in the tweets.
In this part we will start by writing a function to extract all the hashtags and then plot the first ten hashtags used in each label. Basically, these are words with the hashtag symbol (#) and it is commonly used to categorize a tweet. It also eases the search of related tweets.
To begin with, lets write a function to extract the tweets;
Let us now extract hashtags from each label of the tweets as follows;
Now, we will visualize the top ten hashtags in each label. The code and output for these operations are shown below.
i. Top 10 hashtags in pro-vaccine tweets
ii. Top 10 hashtags in anti-vaccine tweets
iii. Top 10 tweets in neutral tweets
Awesome. This marks the end of our data exploration and visualization for this project. Feel free to explore and visualize more. There are tons of ways of exploration that can be done for this data. I hope what I have covered in this article helps.
In the last article, I will cover feature extraction and machine learning algorithms. Stay tuned for the next article. Meanwhile, keep learning and practicing.