You're reading for free via Souradip Pal's Friend Link. Become a member to access the best of Medium.

Encoding Categorical Data: A Step-by-Step Guide

Published in

Towards AI

5 min readSep 2, 2024

Imagine you’re baking a cake, but instead of sugar, flour, and eggs, you have words like “vanilla,” “chocolate,” and “strawberry” on your countertop. As much as you’d like to start, there’s a problem — your recipe can only follow numeric measurements, not words. This is exactly what happens when you try to feed categorical data into a machine-learning model. The model needs numbers to work its magic, not strings of text.

In this hands-on tutorial, we’ll unravel the mystery of encoding categorical data so your models can process it with ease. We’ll break down the types of categorical data, discuss when and why each encoding method is used, and dive into Python code examples that show exactly how to get the job done.

Understanding Categorical Data

Before we start transforming data, let’s get our definitions straight. In the world of data, you generally have two types: numerical and categorical. Machine learning models can easily understand numbers — no surprise there! But when it comes to words or labels, we need to convert these into numbers to help our models “understand” the data.

Types of Categorical Data

Ordinal Data:
Ordinal data is like your favorite Netflix ranking list — it’s ordered, but the intervals between the ranks aren’t necessarily equal. For instance, if you have a dataset of student grades (Poor, Average, Good), you can see that “Good” is better than “Average,” and “Average” is better than “Poor.” This inherent order is what makes it “ordinal.”
Nominal Data:
On the other hand, nominal data is like choosing your favorite ice cream flavor — there’s no logical order to the choices. Whether it’s “Vanilla,” “Chocolate,” or “Strawberry,” one isn’t inherently better or worse than the others. Here, the categories are simply different, without any ranking or comparison.

Left represents ORDINAL DATA while Right represents NOMINAL DATA. Image generated by Dall-E

Why Encoding is Necessary

Machine learning models can’t work directly with categorical data — especially when that data comes in the form of words or labels.

The models require numeric input, so we must convert those categories into numbers. This process is known as encoding categorical data.

Types of Encoding Techniques

To handle different types of categorical data, there are specific encoding techniques you can use:

Ordinal Encoding
One Hot Encoding
Label Encoding

Let’s break down each of these with Python code examples.

1. Ordinal Encoding

Use Case:
Ordinal Encoding is the go-to technique for transforming ordinal data — categories with a meaningful order but no fixed interval between them.

Example:
Let’s say you have a column in your dataset representing education levels: “High School,” “Bachelor’s,” and “Master’s.” We know that “Master’s” is higher than “Bachelor’s,” which is higher than “High School.” Here’s how you can encode it:

from sklearn.preprocessing import OrdinalEncoder

education_levels = [["High School"], ["Bachelor's"], ["Master's"]]
encoder = OrdinalEncoder()
encoded_levels = encoder.fit_transform(education_levels)
print(encoded_levels)

Step-by-Step Explanation:

Import the library: You need OrdinalEncoder from sklearn.preprocessing.
Define your data: List out the categories in your column.
Initialize the encoder: Create an instance of OrdinalEncoder.
Fit and transform: Apply the encoder to your data, converting categories into numbers.

Output:
This code will give you a numerical representation of the education levels. For example, “High School” might be encoded as 0, “Bachelor’s” as 1, and “Master’s” as 2.

2. One Hot Encoding

Use Case:
One Hot Encoding is your best friend when dealing with nominal data — categories without any order.

Example:
Consider a dataset with a “Color” column containing values like “Red,” “Green,” and “Blue.” Since there’s no inherent order, you’d use One Hot Encoding:

from sklearn.preprocessing import OneHotEncoder

colors = [["Red"], ["Green"], ["Blue"]]
encoder = OneHotEncoder(sparse=False)
encoded_colors = encoder.fit_transform(colors)
print(encoded_colors)

Step-by-Step Explanation:

Import the library: Use OneHotEncoder from sklearn.preprocessing.
Define your data: List out the categories in your column.
Initialize the encoder: Create an instance of OneHotEncoder and set sparse=False to get a dense array output.
Fit and transform: Apply the encoder, which will create a binary column for each category.

Output:
The output will be a matrix where each row corresponds to a color, and each column is a binary indicator (0 or 1) for whether the color is “Red,” “Green,” or “Blue.”

Why `sparse=False`?

Alright, let’s pause for a second. You might be wondering, “What’s up with this sparse=False parameter?” It’s like a tiny switch in your code, but it can make a big difference depending on your situation.

By default,

One Hot Encoding can produce something called a sparse matrix — a matrix where most of the elements are zeros.

Now, this is super efficient in terms of memory if you’re dealing with large datasets, especially when there are tons of categories. But, here’s the catch: if your dataset is small or you’re just playing around with some code, dealing with sparse matrices can be a bit like reading fine print. It’s there, but it’s hard to work with directly.

When you set sparse=False, you’re telling Python, “Give me the full picture.” Instead of a compact matrix filled mostly with zeros, you get a dense matrix—an array where all those zeros are visible and accounted for. This makes it easier to see and work with your data, especially if you’re more concerned with readability and simplicity rather than saving a tiny bit of memory.

In short, if you want to directly see your encoded data without worrying about any technical nuances of sparse matrices, flipping that sparse=False switch is the way to go!

3. Label Encoding

Use Case:
Label Encoding is used for the target variable in your dataset, whether it’s ordinal or nominal.

Example:
Suppose you have a target variable like “Yes” and “No” in a binary classification task:

from sklearn.preprocessing import LabelEncoder
labels = ["Yes", "No", "Yes", "No"]
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)
print(encoded_labels)

Step-by-Step Explanation:

Import the library: Use LabelEncoder from sklearn.preprocessing.
Define your data: List out the labels in your target variable.
Initialize the encoder: Create an instance of LabelEncoder.
Fit and transform: Apply the encoder to your labels.

Output:
This code will convert “Yes” and “No” into 1s and 0s, respectively, making it ready for model training.

Conclusion

In this guide, we’ve walked through the essential steps to encode categorical data, turning those strings and labels into numbers that machine learning models can understand. Whether you’re working with ordinal or nominal data, there’s an encoding technique tailored to your needs. Ordinal Encoding, One Hot Encoding, and Label Encoding each serve a distinct purpose, ensuring your models are fed the right kind of data.

Remember, the choice of encoding technique can significantly impact the performance of your machine-learning model, so choose wisely based on the nature of your data. Now that you’ve got the basics down, you’re ready to start encoding like a pro!

Encoding Categorical Data: A Step-by-Step Guide

Understanding Categorical Data

Types of Categorical Data

Why Encoding is Necessary

Types of Encoding Techniques

1. Ordinal Encoding

2. One Hot Encoding

Why `sparse=False`?

3. Label Encoding

Conclusion

Published in Towards AI

Written by Souradip Pal

No responses yet

More from Souradip Pal and Towards AI

How to Build Deep Learning Models (Without a PhD)

Did you know that in the world of deep learning, you don’t need a PhD to cast your own spells? In fact, a recent study found that over 80%…

Llm Fine Tuning Guide: Do You Need It and How to Do It

Working with LLMs, one of the most popular questions we get is about fine-tuning. Every second client asks if they should do additional…

Step-by-Step Exploration of Transformer Attention Mechanisms

A Practical Walkthrough of Training Transformer Models with Insights into Positional Encoding and Its Role in Attention Dynamics

How to stay updated on AI trends (without spending hours online)

Keeping up with artificial intelligence (AI) trends these days feels a bit like trying to sip from a firehose — except the firehose is…

Recommended from Medium

Statistics Handbook for Data Analysts

Why This Handbook?

Step-by-Step Exploration of Transformer Attention Mechanisms

A Practical Walkthrough of Training Transformer Models with Insights into Positional Encoding and Its Role in Attention Dynamics

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Coding & Development

Natural Language Processing

How to Scrape and Analyse Data for Free using AI: From Collection to Insight

Learn how to combine web scraping, proxies, and AI-powered language models to automate data extraction and gain actionable insights…

Probability Distributions: Poisson vs. Binomial Distribution

Using Soccer to Understand the Difference Between Poisson & Binomial: Probability for Data Science Series (3)

Table Extraction using LLMs: Unlocking Structured Data from Documents

Nanonets evaluates multiple LLM APIs for table extraction, comparing their performance and summarizing the challenges, advantages, and…

Comparing DataFrames.jl With Pandas For Python

A quick look at the different features in two venerable libraries from different languages.

Encoding Categorical Data: A Step-by-Step Guide

Understanding Categorical Data

Types of Categorical Data

Why Encoding is Necessary

Types of Encoding Techniques

1. Ordinal Encoding

2. One Hot Encoding

Why sparse=False?

3. Label Encoding

Conclusion

Published in Towards AI

Written by Souradip Pal

No responses yet

More from Souradip Pal and Towards AI

How to Build Deep Learning Models (Without a PhD)

Did you know that in the world of deep learning, you don’t need a PhD to cast your own spells? In fact, a recent study found that over 80%…

Llm Fine Tuning Guide: Do You Need It and How to Do It

Working with LLMs, one of the most popular questions we get is about fine-tuning. Every second client asks if they should do additional…

Step-by-Step Exploration of Transformer Attention Mechanisms

A Practical Walkthrough of Training Transformer Models with Insights into Positional Encoding and Its Role in Attention Dynamics

How to stay updated on AI trends (without spending hours online)

Keeping up with artificial intelligence (AI) trends these days feels a bit like trying to sip from a firehose — except the firehose is…

Recommended from Medium

Statistics Handbook for Data Analysts

Why This Handbook?

Step-by-Step Exploration of Transformer Attention Mechanisms

A Practical Walkthrough of Training Transformer Models with Insights into Positional Encoding and Its Role in Attention Dynamics

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Coding & Development

Natural Language Processing

How to Scrape and Analyse Data for Free using AI: From Collection to Insight

Learn how to combine web scraping, proxies, and AI-powered language models to automate data extraction and gain actionable insights…

Probability Distributions: Poisson vs. Binomial Distribution

Using Soccer to Understand the Difference Between Poisson & Binomial: Probability for Data Science Series (3)

Table Extraction using LLMs: Unlocking Structured Data from Documents

Nanonets evaluates multiple LLM APIs for table extraction, comparing their performance and summarizing the challenges, advantages, and…

Comparing DataFrames.jl With Pandas For Python

A quick look at the different features in two venerable libraries from different languages.

Why `sparse=False`?