You're reading for free via Souradip Pal's Friend Link. Become a member to access the best of Medium.
Encoding Categorical Data: A Step-by-Step Guide
Imagine you’re baking a cake, but instead of sugar, flour, and eggs, you have words like “vanilla,” “chocolate,” and “strawberry” on your countertop. As much as you’d like to start, there’s a problem — your recipe can only follow numeric measurements, not words. This is exactly what happens when you try to feed categorical data into a machine-learning model. The model needs numbers to work its magic, not strings of text.
In this hands-on tutorial, we’ll unravel the mystery of encoding categorical data so your models can process it with ease. We’ll break down the types of categorical data, discuss when and why each encoding method is used, and dive into Python code examples that show exactly how to get the job done.
Understanding Categorical Data
Before we start transforming data, let’s get our definitions straight. In the world of data, you generally have two types: numerical and categorical. Machine learning models can easily understand numbers — no surprise there! But when it comes to words or labels, we need to convert these into numbers to help our models “understand” the data.
Types of Categorical Data
- Ordinal Data:
Ordinal data is like your favorite Netflix ranking list — it’s ordered, but the intervals between the ranks aren’t necessarily equal. For instance, if you have a dataset of student grades (Poor, Average, Good), you can see that “Good” is better than “Average,” and “Average” is better than “Poor.” This inherent order is what makes it “ordinal.” - Nominal Data:
On the other hand, nominal data is like choosing your favorite ice cream flavor — there’s no logical order to the choices. Whether it’s “Vanilla,” “Chocolate,” or “Strawberry,” one isn’t inherently better or worse than the others. Here, the categories are simply different, without any ranking or comparison.
Why Encoding is Necessary
Machine learning models can’t work directly with categorical data — especially when that data comes in the form of words or labels.
The models require numeric input, so we must convert those categories into numbers. This process is known as encoding categorical data.
Types of Encoding Techniques
To handle different types of categorical data, there are specific encoding techniques you can use:
- Ordinal Encoding
- One Hot Encoding
- Label Encoding
Let’s break down each of these with Python code examples.
1. Ordinal Encoding
Use Case:
Ordinal Encoding is the go-to technique for transforming ordinal data — categories with a meaningful order but no fixed interval between them.
Example:
Let’s say you have a column in your dataset representing education levels: “High School,” “Bachelor’s,” and “Master’s.” We know that “Master’s” is higher than “Bachelor’s,” which is higher than “High School.” Here’s how you can encode it:
from sklearn.preprocessing import OrdinalEncoder
education_levels = [["High School"], ["Bachelor's"], ["Master's"]]
encoder = OrdinalEncoder()
encoded_levels = encoder.fit_transform(education_levels)
print(encoded_levels)
Step-by-Step Explanation:
- Import the library: You need
OrdinalEncoder
fromsklearn.preprocessing
. - Define your data: List out the categories in your column.
- Initialize the encoder: Create an instance of
OrdinalEncoder
. - Fit and transform: Apply the encoder to your data, converting categories into numbers.
Output:
This code will give you a numerical representation of the education levels. For example, “High School” might be encoded as 0, “Bachelor’s” as 1, and “Master’s” as 2.
2. One Hot Encoding
Use Case:
One Hot Encoding is your best friend when dealing with nominal data — categories without any order.
Example:
Consider a dataset with a “Color” column containing values like “Red,” “Green,” and “Blue.” Since there’s no inherent order, you’d use One Hot Encoding:
from sklearn.preprocessing import OneHotEncoder
colors = [["Red"], ["Green"], ["Blue"]]
encoder = OneHotEncoder(sparse=False)
encoded_colors = encoder.fit_transform(colors)
print(encoded_colors)
Step-by-Step Explanation:
- Import the library: Use
OneHotEncoder
fromsklearn.preprocessing
. - Define your data: List out the categories in your column.
- Initialize the encoder: Create an instance of
OneHotEncoder
and setsparse=False
to get a dense array output. - Fit and transform: Apply the encoder, which will create a binary column for each category.
Output:
The output will be a matrix where each row corresponds to a color, and each column is a binary indicator (0 or 1) for whether the color is “Red,” “Green,” or “Blue.”
Why sparse=False
?
Alright, let’s pause for a second. You might be wondering, “What’s up with this sparse=False
parameter?” It’s like a tiny switch in your code, but it can make a big difference depending on your situation.
By default,
One Hot Encoding can produce something called a sparse matrix — a matrix where most of the elements are zeros.
Now, this is super efficient in terms of memory if you’re dealing with large datasets, especially when there are tons of categories. But, here’s the catch: if your dataset is small or you’re just playing around with some code, dealing with sparse matrices can be a bit like reading fine print. It’s there, but it’s hard to work with directly.
When you set sparse=False
, you’re telling Python, “Give me the full picture.” Instead of a compact matrix filled mostly with zeros, you get a dense matrix—an array where all those zeros are visible and accounted for. This makes it easier to see and work with your data, especially if you’re more concerned with readability and simplicity rather than saving a tiny bit of memory.
In short, if you want to directly see your encoded data without worrying about any technical nuances of sparse matrices, flipping that sparse=False
switch is the way to go!
3. Label Encoding
Use Case:
Label Encoding is used for the target variable in your dataset, whether it’s ordinal or nominal.
Example:
Suppose you have a target variable like “Yes” and “No” in a binary classification task:
from sklearn.preprocessing import LabelEncoder
labels = ["Yes", "No", "Yes", "No"]
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)
print(encoded_labels)
Step-by-Step Explanation:
- Import the library: Use
LabelEncoder
fromsklearn.preprocessing
. - Define your data: List out the labels in your target variable.
- Initialize the encoder: Create an instance of
LabelEncoder
. - Fit and transform: Apply the encoder to your labels.
Output:
This code will convert “Yes” and “No” into 1s and 0s, respectively, making it ready for model training.
Conclusion
In this guide, we’ve walked through the essential steps to encode categorical data, turning those strings and labels into numbers that machine learning models can understand. Whether you’re working with ordinal or nominal data, there’s an encoding technique tailored to your needs. Ordinal Encoding, One Hot Encoding, and Label Encoding each serve a distinct purpose, ensuring your models are fed the right kind of data.
Remember, the choice of encoding technique can significantly impact the performance of your machine-learning model, so choose wisely based on the nature of your data. Now that you’ve got the basics down, you’re ready to start encoding like a pro!