Treating missing values in a dataset is one of the skills a data analyst must have since missing values is a common occurrence in data analysis. You may have a dataset with missing values at the point of data extraction or data collection. This can reduce the power of your model, lead to wrong prediction or classification because of the inability to analyze the behavior and relationship with other variables correctly in your dataset. In this article, I have briefly explored 4 methods that you can use to treat missing values in your dataset.
Deletion is the most common and easy way of treating missing values in a data set. The tendency to pick it as the first choice as a data analyst is very high. Let us see what happens here. First and foremost, there are 2 types of deletion i.e. listwise deletion and pairwise deletion.
In Listwise deletion, all the observations where a variable is missing are deleted. It basically means deleting an entire row of variables. The disadvantage of this type of deletion is that it reduces that sample size and hence interferes with the accuracy of the model.
Pairwise deletion, on the other hand, involves deleting the variable whose value is missing while retaining the other variables that have values. The implication of this type of deletion is that you end up with different sample sizes. This can complicate the building of models.
In as much as deletion is simple and convenient, it is not the best method for treating missing values for your data.
- Mean/Mode/Median Imputation
Imputation is the process of using valid (non-missing) values to estimate missing values in a data set. In this method, use known values of the data sets to create relationships that can help identify missing values. In this type of imputation, determine the mean or median or mode of the other cases to replace the missing values. Mean and median are most appropriate for quantitative data while mode works well with qualitative data. Imputation assumes the statistical homogeneity of the sample.
Mean/Median/Mode imputation can be generalized or specialized (similar case).
i. Generalized imputation
The missing value is replaced by the mean or median of all the non-missing values of that variable.
ii. Specialized imputation
The missing value is replaced by the mean or median of the non-missing values of the variable based on the specific cases of the observations. For example, while replacing missing values of the height of a population, you can use the mean height of the male population to replace the missing values of the male population. Similarly, while replacing missing values of their female counterparts, use the mean or median of the known height of the female population.
- Predictive Model
Predictive Model can be used to treat missing values in a data set. In this case, create a predictive model to estimate the values that will be used to replace missing values. Here, you can use linear regression and ANOVA to do the prediction.
While creating the predictive model, divide the data set into two sets. The first set is referred to as a training set and contains observations with non-missing values. The second set is a test set that contains observations with missing values. The variables in the test set are treated as target variables. After this, create a model to predict the missing values of the target variables.
When creating this model, take into account other attributes of the training data set. This means you should be able to explore your data better enough to understand the attributes that may influence the predictive model.
- k-Nearest Neighbor (KNN) Imputation
This method of treating missing values is based on the kNN algorithm. Here, missing values of an attribute are imputed using the given number of attributes that are similar to the attribute whose values are missing. The values used to replace the missing values are obtained by using similarity-based distance metrics. The advantage of KNN imputation is that it is able to predict both discrete and continuous attributes in a data set.
In conclusion, the treatment of missing values is very critical when doing data exploration and preparation. You need due diligence so as to have clean data that you can use to create better predictive models. The choice of the method depends on a number of factors. Some of them include the type of data set (quantitative or qualitative), the nature of the business problem, the sample size, etc.