When we are done with Web Scrapping (the process of extracting the data from the source), we have our data however in an unstructured form. What we need to do now is we need to clean our data so that we can explore further.
According to IBM Data Analytics, you can expect to spend up to 80% of your time cleaning data.
The reason for this time-consuming process is that the data that are in raw form can have some abnormal values also known as outliers, some inconsistencies, some missing values, etc. The presence of missing values in a dataset can lead us to various problems. Firstly, the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false. Secondcontibuting factor is, the lost data can cause bias in the estimation of parameters. Third, it can reduce the representativeness of the samples. Fourth, it may complicate the analysis of the study. Each of these distortions may threaten the validity of the trials and can lead to invalid conclusions. Let’s see how to detect and compensate or replace the missing values with an estimate.
Identification of missing values in a dataset
Before proceeding with any EDA (Exploratory Data Analysis) on our data, we need to check whether our dataset has missing values or not. Let’s create a data frame manually for our convenience.
Now the data frame looks something like this:
Our data frame consists of two features. One is ‘Gender’, a categorical feature (a type of feature whose values may be divided into groups), and the other one is ‘Age’, a continuous feature (that can take an infinite number of values between any two values). We can see that there are some missing values (NaN — Not a Number) in our dataset and we’ve figured it out easily. But some dataset has several observations (or records or rows) and it not possible to count the number of missing values manually for each feature (or field or column). This is where the Pandas library comes into play. It is pretty easy to find the missing values in a dataset having more observations with just a single line of Pandas code.
That’s it. Hurray! We did it! But a limitation of this method is that it works only when the missing value is np. NaN. There are cases where the missing values are 0 or something, depending on the features of the dataset. In that case, we can use the value_counts() method of Pandas.
This is one method I could come up with, but there might be several methods easier than this.
Handling and Imputation of missing values
There are several ways to handle and impute missing values in our dataset depending on the type of the feature. First, we need to check the proportion of the data that are missing in the feature concerning the total number of values in that feature. I would suggest that, if more than 40% of the values in a feature are missing, just simply drop that feature out of your dataset. It is not a thumb rule or something. It’s something from my observation. Again, Pandas helps us to find the proportion of missing values in a feature.
In our case, the output would be like:
By referring to the index ‘NaN’, our dataset contains 30% of missing values in both features. Our job is done easily using Pandas. That’s the beauty of Pandas. If any of the features seem to have more than 40% of missing data, then we can simply drop out the feature from our dataset. Again, I am telling you this is not a thumb rule. If you are sure that you can impute those 40% or more missing values with somewhat precise values, you can go with your strategy.
Now there are some strategies to impute missing data with some meaningful values in our feature. We can’t impute missing data with some random values just like that. Scikit-Learn has some predefined strategies for imputing missing values. They are ‘mean’, ‘median’, ‘most_frequent’, and ‘constant’. We need to implement these strategies based on the type of feature that we are going to handle. There are two cases of handling missing values.
Case 1: Handling with Categorical feature
Let’s consider the same dataset that we have used in the earlier section. Here, one of the features, known as ‘Gender’ with 10 observations having values either Male or Female, is a categorical feature. Few observations are missing in the ‘Gender’ feature and we want to fill those missing values (NaN) with some strategy. As our feature is categorical, we need to fill the missing values using a strategy called ‘most_frequent’ (i.e) imputing missing values with the most frequently occurring group. In our case, there are 5 values with the ‘Male’ group and 2 values with ‘Female’. The missing values (or) NaN values are replaced by ‘Male’. Pretty easy, isn’t it?
We can achieve this using Scikit-Learn’s SimpleImputer (sklearn.impute.SimpleImputer)
Case 2: Handling with Continuous feature
In this case, we are gonna try two strategies such as ‘mean’ and ‘median’. Let’s discuss the ‘mean’ first, then the limitation of using ‘mean’ and how ‘median’ overcomes the limitation of ‘mean’.
(i) ‘mean’ as the imputation strategy:
As we all know that the “mean” is just the “average” we’re used to, where we add up all the numbers and then divide by the number of numbers. Here also we are going to do the same thing. We are computing the mean of existing values and replace the missing values with the computed mean value. Let’s find out the mean for our ‘Age’ feature.
The mean is around 51.42. (i.e) (23+25+27+22+26+25+212)/7 = 360/7 = 51.428.
Now we can use the same impute library of Scikit-Learn by setting the strategy parameter of SimpleImputer as ‘mean’.
Now the output after imputing missing values (in both the features) would be like:
Wait, What? Is that so? The mean is 51.42? Of course, yes. But how this could be meaningful? The ages of most of the observations are around 22–26, except the last one which is 212. Whoa! That’s huge! This kind of error can often be seen in the real-world dataset where there might be some chances of human errors. The age of the last one might be 21. Due to some human error (typographical error), it may become 212. These kinds of values are often known as outlier. If the correct age(21) is considered and if we compute the mean, we can get the mean somewhere around 24–25, which is pretty reasonable. A single abnormal value in the feature affects the entire distribution of the feature. This is the limitation of using ‘mean’ as the imputation strategy. Let’s see how the ‘median’ overcomes this problem.
(ii) ‘median’ as the imputation strategy:
The “median” is the “middle” value in the list of numbers. To find the median, the numbers have to be listed in numerical order from smallest to largest, so we may have to rewrite the list before we can find the median. If the number of values in the list is odd, we simply consider the middle value as the ‘median’. If the number of values in the list is median, we need to take the ‘mean’ of the two middle values and consider the resultant value as the ‘median’. This is how the median works. Let’s try to find the ‘median’ value for our ‘Age’ feature.
The median is 25.0 for the ‘Age’ feature. Ah! we did it! It does make some sense when compared with the ‘mean’ which is 51.42! Let’s visualize how we got this solution. First, we arrange the values in ascending order (i.e smaller to larger)
22, 23, 25, 25, 26, 27, 212
Since the number of values is odd, we consider the middle value as ‘median’ which is 25. Thus ‘median’ give less priority to the extreme values.
Now, let’s impute the missing values with the ‘median’ as a strategy.
The resultant data frame would be:
So, it always depends on the feature of the dataset. If there is a dataset that has great outliers, I’ll prefer ‘median’. E.g. 99% of household income is below 100, and 1% is above 500. On the other hand, if we work with the wear of clothes that customers give to the dry-cleaner (assuming that dry-cleaners’ operators fill this field intuitively), I’ll fill missing values with the ‘mean’ value wear.
The latest version of Scikit-Learn which is version 0.22 has introduced a new way of imputing missing values by predicting missing values with a machine learning algorithm known as ‘k-Nearest Neighbors’.
Want to learn about k-Nearest Neighbors? Take a look at this — Machine Learning Basics with the K-Nearest Neighbors Algorithm
If you are interested in learning Pandas, I would suggest quality content from Kevin Markham’s playlist on Data analysis in Python with pandas which contains 30+ videos. At the end of every video, he gave some bonus tips with his crystal clear explanation and I love his video on Top 25 tricks in Pandas.
If you have any questions, suggestions, or ideas to share, please contact me at email@example.com. Also, you can check my Github Damodhar sai & my Linkedin Profile Damodhar sai for more code and projects in machine learning. Any suggestions are welcome.