With the huge amount of unstructured data commonly used in machine learning processes, one of the challenges is being able to properly preprocess it before conducting any analysis. Text cleaning (preprocessing) consumes a lot of time since it is a very critical part of a machine learning project. To be able to do this efficiently, you need to have the best tools and design a good pipeline that will give you the best possible results. Texthero is one of the tools that can be used to preprocess text data in an easy way. Text preprocessing using texthero can boost your speed and achieve good results in good time.

What is Texthero?

Texthero is a python package used to preprocess, visualize, conduct text representation and perform some NLP on text data in a pandas data frame or series. In this article, I will focus on the text preprocessing functionality of texthero. I have also done another tutorial on how to use texthero in visualization.

Texthero makes text preprocessing easy and efficient with a very easy-to-use API. It easily integrates with pandas library to make text cleaning and visualization fun. I would suggest that you try it on your text datasets and see how much of your time it will be able to save. Here is the documentation for texthero. It is free and open-source.

Let us get into the actual work by writing some python code. I will take you through a step-by-step guide on how you can use texthero to preprocess your text. Like any other python library, we will start by installing it.

The easy way of doing this will be to use pip install texthero. This will sufficiently install texthero on your machine. In this tutorial, I will use Jupyter notebook to write and run my code.

After successful installation, we will import texthero and pandas library to help us work on our dataset.

Load Data

After importing the libraries, we will then import our dataset. In this tutorial, I will a dataset I downloaded from the UCL Machine learning repository. It contains youtube reviews datasets. We will use it to demonstrate how we can preprocess text data using texthero.

So let us import the dataset and perform a few checks before we dive into text preprocessing.

We will first create a data frame that has only one column ‘CONTENT’ by removing the columns we don’t need.

Text Preprocessing in Python

Let us now use texthero to clean our text. One of the greatest advantages of using texthero is its simplicity. With one line of code, you are able to have your dataset clean enough to be used for analysis. It employs a pipeline to make sure that all that you may need in terms of cleaning is catered for. This can be done by calling the .clean() function and passing a data frame that returns clean data or you can customize it to specify what you need.

Let us start by using .clean() function as below.

From the dataframe above, we can see that there are a number of things that have happened for us to have clean data. What the .clean() function does is that it uses a pipeline to clean the dataset. This is usually the default preprocessing pipeline. The pipeline includes the following tasks:

  • Converts the text into lower case using lowercase(s) function.
  • Removes block of digits by using remove_diatcritics() function.
  • Removes punctuations using remove_punctuation() function.
  • Removes stopwords using remove_stopwords() function.
  • Removes extra white spaces between words using remove_whitespace() function.
  • Removes unassigned values with empty spaces using fillna(s) function.

This makes the text clean as we can see above.

Custom Preprocessing

Sometimes you may want to customize your preprocessing depending on the task you are doing. For instance, you may not want to remove a block of digits from your text for a particular reason.

So to achieve this, you can still use texthero to customize your preprocessing. Let us explore different functions to preprocess our text one by one.

a. Lower Casing.

Here, we want to convert our text to lower case without performing any other preprocessing task on it. We will use the lowercase(s) function as follows:

From the column ‘lower’, we have converted our text to lower case without making any other further changes. For instance, stopwords and punctuations are still there.

b. Remove URLS

Let us remove URLs present in the text using remove_urls() functions as follows;

Great. We have removed all the urls in the text without making any further changes.

c. Remove Punctuations

Next, we will remove punctuations only. Let us use the remove_punctuation() functions as follows:

Our ‘no_punctuations’ column has the text data with no punctuations with nothing else done on it.

d. Custom Pipeline

From the examples above, we can use texthero functions as we wish to perform what we want on our text data. But in addition to that, we can create our custom pipeline to include two or more functions that we need. This make the preprocessing happen faster and can save us alot of time instead of performing them one by one.

In the example below, we will combine the three functions above to get a clean text in lower case, and without URLs and punctuations.

Now that we have created a custom pipeline, let us apply it to our data frame to clean the text by using the three functions.

The custom_pipeline has efficiently achived waht we specified i.e. it has removes the urls, punctuations and converted the text to lower case. This shows the potnetial of texthero in performing text cleaning.

e. Additional Functions in Texthero

As you can see, text preprocessing using texthero is awesome. Here is a list of functions in texthero that you can use to create a custom preprocessing pipeline to clean your data:

Texthero functionWhat it doesSample textClean text
remove_html_tags()Removes HTML tags from a text <H1>This is my story<H1> This is my story.
remove_angle_brackets()Removes content within angle brackets <> and the angle brackets.<user> I can’t agree more.I cant agree more
remove_brackets()Removes content within brackets () and the brackets.None of them (plans) can be ignoredNone of them can be ignored.
remove_curly_brackets()Removes content within curly brackets {} and the curly brackets. Most users visit {the site} on MondaysMost users visit on Mondays
remove_square_brackets()Removes content within square brackets [] and square brackets itself. None of them [plans] can be ignored.None of the them can be ignored.
replace_urls(s, symbol)Replaces all URLs with the given symbolYou can get us on https://www.yourdataguy.org/You can get us on #.
replace_punctuation(input, symbol)Replaces all punctuations with the given symbol.Jack is a great techie. He is cool, bright, and dedicated. Jack is a great techie* He is cool* bright* and dedicated*

Conclusion

Texthero is a very efficient text cleaning package that you can use easily in your NLP projects to quickly clean your data. With texthero, you have the freedom to create your own custom pipelines that suits your project needs.

Make your text preprocessing using texthero is faster and efficient. Check the full notebook and dataset in my Github account. If this tutorial has been useful to you, please share it with others, and let’s keep learning.