You're reading for free via Bex T.'s Friend Link. Become a member to access the best of Medium.
Yes, You Can Build Your Own Custom Sklearn Transformers. Here Is How
Transformers for any preprocessing scenario
Learn to write custom Sklearn preprocessing transformers that make your code exceptional.
Introduction
Single fit
, single predict
—how awesome would that be?
You get the data, fit your pipeline just one time, and it takes care of everything — preprocessing, feature engineering, modeling, everything. All you have to do is call predict and have the output.
What kind of pipeline is that powerful? Sklearn has many transformers, but it doesn’t have one for every imaginable preprocessing scenario. So, is such a pipeline a pipe dream?
Absolutely not. Today, we will learn how to create custom Sklearn transformers that enable you to integrate virtually any function or data transformation into Sklearn’s Pipeline classes.
What are Sklearn pipelines?
Below is a simple pipeline that imputes the missing values in numeric data, scales them, and fits an XGBRegressor to X
, y
:
I have talked at length about the nitty-gritty of Sklearn pipelines and their benefits in an older post.
The most notable advantages of pipelines are their ability to collapse all preprocessing and modeling steps into a single estimator, preventing data leakage by never calling fit
on validation sets and an added bonus that makes the code concise, reproducible, and modular.
But this whole idea of atomic, neat pipelines breaks when we need to perform operations that are not built into Sklearn as estimators. For example, what if you need to extract regex patterns to clean text data? What do you do if you want to create a new feature combining existing ones based on domain knowledge?
To keep all the benefits that come with pipelines, you need a way to integrate your custom preprocessing and feature engineering logic into Sklearn. That’s where custom transformers come into play.
Integrating simple functions with FunctionTransformer
In September 2021 TPS Competition on Kaggle, one of the ideas that boosted model performance significantly was adding the number of missing values in a row as a new feature. This is a custom operation, not implemented in Sklearn, so let’s create a function to achieve that after importing the data:
Let’s create a function that takes a DataFrame as input and implements the above operation:
Now, adding this function into a pipeline is just as easy as passing it to the FunctionTransformer
:
Passing a custom function to FunctionTransformer
creates an estimator with fit
, transform
and fit_transform
methods:
Since we have a simple function, there is no need to call fit
as it just returns the estimator untouched. The only requirement of FunctionTransformer
is that the passed function should accept the data as its first argument. Optionally, you can pass the target array as well if you need it inside the function:
FunctionTransformer
also accepts an inverse of the passed function if you ever need to revert the changes:
Check out the documentation for details on other arguments.
Integrating more complex preprocessing steps with custom transformers
One of the most common scaling options for skewed data is a logarithmic transform. But here is a caveat: if a feature has even a single 0, the transformation with the common np.log
function return an error. So, as a workaround, Kagglers add 1 to all samples and then apply the logarithmic transform.
Custom transformations like that require inverse transformations as well, For logarithms, you need to use the exponential function on the transformed array and subtract 1. Here is what it looks like in code:
This works, but we have the same old problem — we can’t include this into a pipeline out of the box. Sure, we could use our newfound friend FunctionTransformer
, but it is not well-suited for more complex preprocessing steps such as this.
Instead, we will write a custom transformer class and create the fit
, transform
functions manually. In the end, we will again have a Sklearn-compatible estimator that we can pass into a pipeline. Let's start:
We first create a class that inherits from BaseEstimator
and TransformerMixin
classes of sklearn.base
. Inheriting these classes allows Sklearn pipelines to recognize our classes as custom estimators.
Then, we will write the __init__
method, where we initialize an instance of PowerTransformer
:
Next, we write the fit
where we add 1 to all features in the data and fit the PowerTransformer
:
The fit
method should return the transformer itself, which is done by returning self
. Let's test what we have done so far:
Working as expected, so far.
Next, we have the transform
, in which we use the transform
method of PowerTransformer
after adding 1 to the passed data:
Let’s make another check:
Working as expected. Now, as I said earlier, we need a method for reverting the transform:
We also could have used np.exp
instead of inverse_transform
. Now, let's make a final check:
But wait! We didn’t write
fit_transform
- where did that come from?It is simple — when you inherit from
BaseEstimator
andTransformerMixin
, you get afit_transform
method for free.
After the inverse transform, you can compare it with the original data:
Now, we have a custom transformer ready to be included in a pipeline. Let’s put everything together:
Even though log transform hurt the score, we got our custom pipeline working!
In short, the signature of your custom transformer class should be like this:
This way, you get fit_transform
for free. If you don't need any of __init__
, fit
, transform
or inverse_transform
methods, omit them, and the parent Sklearn classes take care of everything. The logic of these methods is entirely up to your needs and your skills.
Wrapping up…
Writing good code is a skill developed over time. You will realize that a big part of it comes from using the existing tools and libraries at the right time and place without reinventing the wheel.
One such tool is Sklearn pipelines, and custom transformers are just extensions of them. Use them well, and you will produce quality code with little effort.
Thank you for readıng!