Machine Learning

How to deal with imbalanced datasets

Ioana Zaman

Published in

Towards AI

3 min readJul 25, 2021

What is an imbalanced dataset?

It is a dataset in which the examples are unequally distributed (i.e., most examples are from a class, while in the other class or classes are much fewer). Some examples are fraud detection or detection of a rare disease.

The problem with an imbalanced dataset is that the model tends to classify all examples as being from the majority class. In general, in this case, it is more important than the examples in the minority class are classified correctly than those in the majority class.

How to get the best out of an imbalanced dataset?

Resampling

Resampling the train set is one way by which an imbalanced dataset can be transformed into a balanced one and it can be done by the following two approaches:

Under-sampling: the method by which the majority class is reduced in order to have a similar size with the other. This tactic is suitable for large datasets because the new balanced dataset consists only of the samples from the minority class and the randomly selected elements from the majority one.
Oversampling: the opposite of the method described above, as the name suggests. In this case, the number of rare samples is increased. New samples can be generated by bootstrapping, repetition, or specific algorithms such as SMOTE or ADASYN. Compared to under-sampling, over-sampling can be computationally expensive.

Creating an ensemble

Another method is to create an ensemble of models, by which the imbalance can be eliminated without losing or adding data. In this approach, the abundant class is split into chunks. For each class, a model is trained and the final result is the one given by the majority of votes from all models. This method is flexible because in most cases the chunks contain as many samples as the rare class, but some of them can have a 1:2 or event 3:1 ratio.

Boosting

This is one of the most popular and used categories of ensemble learning. Boosting is a very useful technique in general, but for imbalanced data even more. By boosting weak models are transformed into strong models by learning from previous mistakes. Why is this suited for imbalanced datasets? Because each successive iteration offers to the minority class a higher weight. Examples of boosting methods are AdaBoost, XGBoost, or GradientBoost.

Not just random

An original idea is proposed on Quora. The author explains how you can under-sampling in a less random way. The majority class is divided into clusters, the number of clusters being approximately the number of samples from the minority class. Next, the chosen samples from the majority class are just the medoids. A medoid is like a centroid but restricted to being part of the dataset. So, the dataset is balanced and contains the most relevant samples.

A proper python package

All these techniques can be applied individually or combined. One simple way to do that is by using imbalanced-learn python package. This is a very useful tool for problems generated by imbalanced datasets, having methods created specifically for this kind of dataset.

A suitable algorithm

Naïve Bayes probabilistic classifiers (especially ComplementNB) are known to have better results on imbalanced data. Another solution can be the usage of an SVM algorithm. Classic SVM maybe won’t have the expected results. But by varying the C, the parameter which measures the importance of each class, results can improve.

How to evaluate a model based on an imbalanced dataset?

An essential aspect of a machine learning model is its evaluation. Even if for some models the accuracy is a popular choice, for a model created using an imbalanced dataset it is a big NO. Accuracy measures how many samples are correctly classified, thus it will have a very high value even if the majority class is always predicted. A suitable evaluation metric is an F1 score because it is defined as the harmonic mean between precision and recall.

It is difficult to work with imbalanced data, but in some cases, this is inevitable and crucial. The most two relevant things to decide in such a situation are the data and the strategies that can be applied.