Data visualization is a very key part of data science and analytics. It helps in getting insights from the data through the patterns depicted by visualization. For any Natural Language Processing (NLP) project, it is equally important to visualize your text data before modeling. Texthero is very helpful in visualizing text-based data. By definition, texthero is a python package used to preprocess, visualize, conduct text representation and perform some NLP on text data in a pandas dataframe or Series. In the previous article, I explored its preprocessing functionality. In this article, we will dive in to help you understand how to easily visualize your text data using texthero.
Texthero is very useful in performing exploratory data analysis (EDA). It offers a very simple and easy-to-use API that allows for quick visualization of text data. Its visualization functionality provides 3 main ways of visualizing text data i.e:
- top words: You can use texthero to visualize N frequent (top) words in your text data.
- word cloud: Texthero makes it possible to create beautiful and fancy word clouds.
- scatter plots: scatter plots can easily be created using texthero. This is especially useful when visualizing categories in the dataset or when using clustering algorithms.
We will endevour to go through the three uses cases for visualization one by one to see how exciting it is to use texthero.
We will start by making sure we have the required libraries in place. In this case, we will need texthero and pandas. If you haven’t installed texthero, please use pip install texthero to quickly install it. After that, let us import them as follows:
We will use YouTube reviews dataset that i downloaded from UCL Machine Learning repository.
Let us now create our preprocessing pipeline by specifying the functions we want to use as follows:
Next, we will use our custom pipeline to clean the text.
We have now cleaned our data as we needed using the custom preprocessing pipeline.
Visualizing Most Frequent Words
In this section, we will use texthero to visualize the most frequent words in the dataset. Let us start by checking the top 12 words used in our and then plot them on a chart.
The display is very nice. You can play around with the number of words you want tod display. This is a very useful feature while exploring your data.
In addition to that, you can also display the most frequent words based on a categorical column. For example, we can display top words based on CLASS as follows:
Next, let us now look at how we can create wordclouds using texthero.
Visualizing Scatter Plots
Finally, let us plot a scatter plot of our data based on the ‘CLASS’ column. The class column has 2 unique values. So our scatter plot will provide two categories of the data.
To be able to create a scatter plot, we need to perform some tasks in three short steps as follows:
- Clean the data. The data must be preprocessed before being used in creating a scatter plot
- Transform the data to vectors. This is done using Term Frequency – Inverse Document Frequency (TF-IDF). TFIDF is a text representation method which evaluates how relevant a word is to a document in a collection of documents.
- Transform the data into 2-dimensions using Principal Component Analysis (PCA). This is a dimensionality reduction technique.
Thankfully, texthero provides an easy way to perform the above tasks as follows:
This is a very nice visualization showing the two classes of reviews in two distinct colors and their distribution in the plot.
In conclusion, we have been able to see how we can use texthero to perform some visualization in our text data. We have been able to visualize the most frequent words, word clouds and create scatter plots. Here is the full notebook with code for this tutorial. Follow along and learn how to use texthero for visualizations. Please share some love by sharing this article.