3 Advanced Techniques for Successfully Handling Missing Data — Part 2

Parth Vichare
6 min readMar 8, 2024

Before diving into Part-2, don’t forget to review Part-1 on Missing Data. It’s crucial for understanding What Problems may occurs & basic concepts while handling missing data efficiently!
Handling Missing Data :- Part-1

Image Credited:- Canva

In this article we are going to cover Part-2 Advance Technique for Handling Missing Data which is mostly useful when it comes to handling missing value in larger and Complicated missing data-values
We are going to cover the Multivariate methods which help fill in missing values by considering multiple columns simultaneously.

Problem need to Solve

In the Part-1 we get problems while handling missing data which column has more then 10% missing values

Image by Author

By looking at the graph, you can see significant changes in the data distribution after filling in the missing values. It’s essential to always use a kdeplot to compare the distribution before and after filling the missing values

Multivariate Imputation Method

Multivariate Imputation methods are implement when a missing value in one column can impact other columns. In such cases, we need advanced techniques that fill in missing values without altering the shape or variability of the data

image by author

When more than 10% of the data in each column is missing, using mean or median imputation can alter the distribution. Complete Case Analysis (CCA), where we remove rows with missing values, can also affect the data quality because it impacts other columns connected to the removed ones

Therefore, we need to use methods that don’t affect the variability, standard deviation, or shape of the data. There are three methods available to achieve this goal

Random Imputation Method

In the Random Imputation method, we fill in missing values by randomly selecting values from the existing data, matching the length of the missing data

zscores_0['Glucose'][zscores_0['Glucose'].isnull()]=zscores_0['Glucose'].dropna().sample(zscores_0['Glucose'].isnull().sum())
zscores_0['Glucose'][zscores_0['Glucose'].isnull()].values

#Plotting after filling missing value with random imputation method
sns.kdeplot(zscores_0['Glucose'])
plt.figure(figsize=(5,6))
plt.show()

This code fills in any missing values in the ‘Glucose’ column of the DataFrame ‘zscores_0’ with random values from the existing non-empty values in that column. First, it finds where the missing values are, then it replaces those with random values picked from the existing non-empty values.

image by author

Yes, based on the output its shows we successfully implement random imputation method & fill all the missing values exciting right

Using Random Imputation method is not always the best choice to handle large amount of missing values which may cause noise & disturbance in ML models

When to use random imputation method

-Data Missing Completely at Random
-
Minimal Impact on Result(like data-distribution, variability)
-
Exploratory Data analysis(EDA)

When it comes to handling large amount of data, maybe working on predictive models, working on Neural Network projects then there is some other techniques which we should know to handle missing values in efficient manner

KNN-Imputation Method

This topic is fascinating because it deals with filling in missing values by considering the characteristics of nearby data points. The algorithm works by filling the gaps with the traits of the nearest neighboring data points

credit:- GeekforGeek example

Based on table provided in every feature, we have missing values according to the structure of table so using KNN imputer should be the best choice this can help us to fill those values which has similar characteristics & feature has other nearest-neighbour data points
Datapoints is the Row in which points carry all the characteristics, features

Euclidean Distance Equation

In KNN imputation, we use the Euclidean distance formula to find the nearest data points with similar features when there are missing values in a column.
Essentially, the KNN-Imputer calculates the Nan-Euclidean distance of the nearest data points to the missing values.

KNN Imputer Equation

In our given table column-wise thier is 2nd-row has missing values
Feature-1(col)= 33,___,23,40,35
So, we will going to calculate Euclidean-distance of all the rows around
missing-value

Nan-Euclidean Formula:-
dist(x,y)=sqrt(weight*sqrt(xi-yi)^2)
where,
Weight=(Total No of Coordinates)/(No of Present coordinates)

​Distance from Point 1
dist(x,y)=sqrt(weight*sqrt(xi-yi))
d=sqrt(3/2*sqrt((68-67)^2 + (12-18)^2)
d=sqrt(1.5(4+81))
distance=11.29

​Distance from Point 3
d=sqrt(3/3 *sqrt((51-45)^2 + (71-68)^2 + (18-12)^2 )
d=sqrt(36+9+36)
distance=9

Distance from Point 4
d=sqrt(3/1 *sqrt((81-68)^2)
d=sqrt(3/9)
distance=5.19

Distance from Point 5
d=sqrt(3/1 *sqrt((60-45)^2 + (79-68)^2)
d=sqrt(1.5/(25+121))
distance=14.79

We have distance of all points=[11.29,__,9,5.19,14.79]
K=2 which represents the number of nearest neighbors to consider
Based on the distance the 5.19& 9 is the nearest & smallest distance point
which is the [2,3] index
Calculate the 2&3 avg mean of index
mean=23+40/2
mean=31.5

The 2nd-row missing value is 31.5 according to the Nan-Euclidean distance

Hope this solving equation gives you a better idea of the whole process of how Euclidean-distance helps in filling in missing based on the features & characteristics around missing data points

Code Implementation:-

#KNN imputation Method

from sklearn.impute import KNNImputer,SimpleImputer
import numpy as np

X=zscores_0[['Glucose','BloodPressure']]
X
imputer= KNNImputer(n_neighbors=2)

X_imputed=imputer.fit_transform(X)
X_imputed[0:3]
zscores_0['BloodPressure']=X_imputed[:,:2]

plt.hist(zscores_0['BloodPressure'])
plt.hist(X['BloodPressure'])
plt.xlabel('Blood Pressure')
plt.show()
image by author

Based on the histplot, variance, and mean values, we can conclude that the KNN imputation method has been successfully implemented. We can also observe slight differences in distribution and other factors, which are expected but manageable

Effect of Handling Missing Values on Analysis

When a business conducts market research or surveys to create predictive models, it’s crucial to remove missing values and outliers from the dataset. These issues can greatly affect the accuracy of the analysis. To avoid these mistakes, it’s important to learn how to handle a large amount of missing data in a huge dataset.

Handling missing values is an ongoing process, and there are many techniques to learn. In our upcoming article series,
“Getting your DATA ML-ready” we’ll cover more methods to handle missing values.

For now, in Part-1 and Part-2, we’ve discussed efficient techniques that you can use during data processing to prepare your data.

Conclusion

We’ve successfully covered the fundamental concepts of Data Preparation, focusing on Univariate and Multivariate Imputation methods used for handling missing values

Follow my Getting Your Data ML-Ready Series-list for weekly articles full of valuable tips on preparing your data for ML. Learn how to transform raw data into the ideal format for training your models. Don’t miss out on the latest techniques to supercharge your data prep

If you enjoyed this article, show your support by giving it a clap! Share your thoughts on the content and your experiences with handling missing values!🙏
Regarding to Part-1 & Part-2 if you get any kind of doubt you can connect with me directly I would love to help you out and Grow together

--

--

Parth Vichare

Hi all, I'm an aspiring Data Scientist eager to share insight on ML algorithms Lets connect & grow together: https://www.linkedin.com/in/parth-vichare-29a407281