Data is a big deal in machine learning and as it has always been analogized, “data is the oil, while the model is the engine”, and the two heavily depend on each other. Data is one of the greatest components of machine learning models and hence data quality matters.

Most real-world projects contain a lot of messy data. This means that it should be prepared well before ingestion into machine learning models.

Why data quality matters

 “Poor data quality is enemy number one to the widespread, profitable use of machine learning”. This was a statement by Thomas C. Redman in his article in the Harvard Business Review.  In the age of Artificial Intelligence and Machine learning, the old adage of ‘Garbage in, garbage out’ makes more sense than ever.

A lot has been penned down about great and state-of-the-art machine learning algorithms and how they can be tuned and improved to make their performance better. This is awesome. However, these algorithms can only work best when there is quality data.

For any model to give the desired cutting-edge performance, it requires high quantity and quality data to train on. The better the quality and the more the quantity, the better the performance of the algorithms will be and the better the AI solution being implemented.

Machine learning models learn from data to be able to make the right predictions. It, therefore, goes without saying that the more data/examples they learn from, the better they will be able to make predictions of the unknown.

Data scientist and data preparation

For this reason, data professionals spend most of their time in data preparation. According to CrowdFlower, provider of a “data enrichment” platform for data scientists, data preparation accounts for about 80% of data scientists’ time. This is according to their survey published on Forbes.

The work entails data collection or acquisition, labeling, annotation, data augmentation, organizing, and cleaning it. It goes on to deal with missing data, duplicate data, and outliers. This is often less enjoyable in the data science pipeline. It requires a lot of time and energy depending on the data and what is required to do.

But, despite the straining work on the preparation, the rewards are always satisfying.

Assessing and improving data quality should be the first step of any machine learning pipeline.

Therefore, as we dive into machine learning projects, a keen look at data quality should always be our top priority. Machine learning models are very sensitive to data quality. For instance, for supervised learning, very small errors in the training set can lead to very large errors in the system’s output.

This problem of overfitting is mainly a result of less and low-quality data. Assessing and improving data quality should be the first step of any machine learning pipeline. This process includes checking for accuracy, consistency, completeness, compatibility, and timeliness to ensure that what goes into the model is right for the model.

As I conclude, data quality is a very important aspect of data profession. It’s benefits are rewarding and it requires hard work to ensure that we have quality data.