You're reading for free via Bex T.'s Friend Link. Become a member to access the best of Medium.

Yes, You Can Build Your Own Custom Sklearn Transformers. Here Is How

Transformers for any preprocessing scenario

Bex T.
Towards AI

--

Learn to write custom Sklearn preprocessing transformers that make your code exceptional.

Image by me with Midjourney

Introduction

Single fit, single predict—how awesome would that be?

You get the data, fit your pipeline just one time, and it takes care of everything — preprocessing, feature engineering, modeling, everything. All you have to do is call predict and have the output.

What kind of pipeline is that powerful? Sklearn has many transformers, but it doesn’t have one for every imaginable preprocessing scenario. So, is such a pipeline a pipe dream?

Absolutely not. Today, we will learn how to create custom Sklearn transformers that enable you to integrate virtually any function or data transformation into Sklearn’s Pipeline classes.

What are Sklearn pipelines?

Below is a simple pipeline that imputes the missing values in numeric data, scales them, and fits an XGBRegressor to X, y:

A sample Sklearn pipeline.

I have talked at length about the nitty-gritty of Sklearn pipelines and their benefits in an older post.

The most notable advantages of pipelines are their ability to collapse all preprocessing and modeling steps into a single estimator, preventing data leakage by never calling fit on validation sets and an added bonus that makes the code concise, reproducible, and modular.

But this whole idea of atomic, neat pipelines breaks when we need to perform operations that are not built into Sklearn as estimators. For example, what if you need to extract regex patterns to clean text data? What do you do if you want to create a new feature combining existing ones based on domain knowledge?

To keep all the benefits that come with pipelines, you need a way to integrate your custom preprocessing and feature engineering logic into Sklearn. That’s where custom transformers come into play.

Integrating simple functions with FunctionTransformer

In September 2021 TPS Competition on Kaggle, one of the ideas that boosted model performance significantly was adding the number of missing values in a row as a new feature. This is a custom operation, not implemented in Sklearn, so let’s create a function to achieve that after importing the data:

png
Data source: Kaggle
Find the number of missing values across rows.

Let’s create a function that takes a DataFrame as input and implements the above operation:

A function that creates two new features to record the number of missing values in a row and row-wise standard deviation.

Now, adding this function into a pipeline is just as easy as passing it to the FunctionTransformer:

Converting a native Python function into an Sklearn transformer.

Passing a custom function to FunctionTransformer creates an estimator with fit, transform and fit_transform methods:

Applying the new custom Sklearn transformer `num_missing_estimator`

Since we have a simple function, there is no need to call fit as it just returns the estimator untouched. The only requirement of FunctionTransformer is that the passed function should accept the data as its first argument. Optionally, you can pass the target array as well if you need it inside the function:

The signature for functions that can be converted into Sklearn transformers.

FunctionTransformer also accepts an inverse of the passed function if you ever need to revert the changes:

A signature for inverse transformation functions for custom transformers.

Check out the documentation for details on other arguments.

Integrating more complex preprocessing steps with custom transformers

One of the most common scaling options for skewed data is a logarithmic transform. But here is a caveat: if a feature has even a single 0, the transformation with the common np.log function return an error. So, as a workaround, Kagglers add 1 to all samples and then apply the logarithmic transform.

Custom transformations like that require inverse transformations as well, For logarithms, you need to use the exponential function on the transformed array and subtract 1. Here is what it looks like in code:

Reversing a logarithmic transformation using the exponential function and subtracting 1 afterwards.

This works, but we have the same old problem — we can’t include this into a pipeline out of the box. Sure, we could use our newfound friend FunctionTransformer, but it is not well-suited for more complex preprocessing steps such as this.

Instead, we will write a custom transformer class and create the fit, transform functions manually. In the end, we will again have a Sklearn-compatible estimator that we can pass into a pipeline. Let's start:

The class definition for a complex custom transformer.

We first create a class that inherits from BaseEstimator and TransformerMixin classes of sklearn.base. Inheriting these classes allows Sklearn pipelines to recognize our classes as custom estimators.

Then, we will write the __init__ method, where we initialize an instance of PowerTransformer:

Writing the __init__ method for the custom class.

Next, we write the fit where we add 1 to all features in the data and fit the PowerTransformer:

Writing the logic of the fit function where we fit the estimator to a copy of X

The fit method should return the transformer itself, which is done by returning self. Let's test what we have done so far:

Testing the current class.

Working as expected, so far.

Next, we have the transform, in which we use the transform method of PowerTransformer after adding 1 to the passed data:

Writing the transform method.

Let’s make another check:

Checking again.

Working as expected. Now, as I said earlier, we need a method for reverting the transform:

Writing the inverse transform method using the `inverse_transform` method of PowerTransformer.

We also could have used np.exp instead of inverse_transform. Now, let's make a final check:

Checking.

But wait! We didn’t write fit_transform - where did that come from?

It is simple — when you inherit from BaseEstimator and TransformerMixin, you get a fit_transform method for free.

After the inverse transform, you can compare it with the original data:

Now, we have a custom transformer ready to be included in a pipeline. Let’s put everything together:

Adding the two custom transformers, `num_missing_row` and `CustomLogTransformer` into a pipeline.

Even though log transform hurt the score, we got our custom pipeline working!

In short, the signature of your custom transformer class should be like this:

The final class signature for writing custom complex transformers in Sklearn.

This way, you get fit_transform for free. If you don't need any of __init__, fit, transform or inverse_transform methods, omit them, and the parent Sklearn classes take care of everything. The logic of these methods is entirely up to your needs and your skills.

Wrapping up…

Writing good code is a skill developed over time. You will realize that a big part of it comes from using the existing tools and libraries at the right time and place without reinventing the wheel.

One such tool is Sklearn pipelines, and custom transformers are just extensions of them. Use them well, and you will produce quality code with little effort.

Thank you for readıng!

Responses (1)

Great tutorial on creating custom Sklearn transformers! Your explanations and examples make it easy to understand and implement. Thank you for sharing your knowledge!

--