Beginners Guide to Machine Learning: Data Pre-processing using Python.

Published in

Towards AI

5 min readApr 25, 2020

The journey of learning machine learning is super long but yet very exciting!

If you are a beginner, this blog is just what you need to get your head start! Data processing is the first tool you learn as a machine learning practitioner. Let’s get started!

It is necessary to preprocess your data in the right way so that the machine learning model you will build can be trained in the right way using that data! This will not seem like an unusual step at the beginning. However, once you learn to do it efficiently, you can rapidly get a hold on various branches of Machine Learning (ML).

Step 1] Importing Libraries

What are libraries?

Libraries in Python are a considerable ensemble of tools, functions, and modules which help in making the desired task easier.

Pandas Library: Allows us to read and retrieve datasets.

Numpy Library: Deals with math and advanced 2D arrays.

Matplotlib Library: Helps us in visualizing data in the form of bar charts, line charts, pie charts, etc.

To call the above library functions with ease, we rename the libraries using short syntax like “np’’, “pd,” “plt.”

These libraries can be imported by following the steps, as shown in the code snippet below:

import numpy as npimport pandas as pdimport matplotlib.pyplot as plt

Step 2] Importing Datasets

What are Datasets?

A Dataset is a set containing all our data that helps us to design and train our machine learning model.

The above data set has 3 independent variables or features: Country, Age, Salary, and 1 dependent variable or target variable: Purchased Product, which can have binary outputs: Yes or No, depending upon the independent variables.

To import the dataset we require pandas library function.

The first thing to do is to create a variable that will contain the dataset and create its data frame.

Then declare two variables to contain features and targets, respectively. The code snippet below shows how to import a dataset.

dataset=pd.read_csv("Path of your csv file location") 
#read_csv is function of pandas which helps in retrieving data in form of data frame.X=dataset.iloc[:,:-1].values #X stores features
#iloc is a function of pandas which helps us split dataset based on its index location.y=dataset.iloc[:,-1].values #y stores target

Step 3] Handling the missing data

As you can observe, the above dataset has some missing data in the columns Age and salary, which may serve as reasons for errors while training the model, if not handled. Thus, it is necessary to take care of missing data values in a dataset.

In the case of large datasets, the entry having missing data can be ignored or removed directly. However, if your data set is compact or restricted, the missing data needs to be taken care of.

Thus, the method that can be used to handle missing data is taking the average of all entries in the column having missing data. To do this, we import a famous ML library called sci-kit-learn. It contains data preprocessing tools to handle missing values. We will use the class Imputer to handle missing values.

Code snippet:

from sklearn.impute import SimpleImputerimputer = SimpleImputer(missing_values = np.nan, strategy='mean')imputer.fit(X[:,1:3])// enter columns which contain real numbersX[:,1:3]= imputer.transform(X[:,1:3])print(X)

Step 4] Encoding Categorical Data

Almost all datasets contain a categorical data column. This column of strings needs to be processed and converted into real numbers. Linearly arranging the categories may give a false impression of an existing relationship between data causing an error in the model.

Thus we will use a one-hot-encoding method to convert the categorical data into numerical data. However, we will convert the categorical data of the target variable containing yes and no into 1 and 0, respectively. It will not harm the future accuracy of data.

#Encoding Independent variablefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import OneHotEncoderct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])], remainder='passthrough')X= np.array(ct.fit_transform(X))print(X)#Encoding Dependent Variablefrom sklearn.preprocessing import LabelEncoderle=LabelEncoder()y= le.fit_transform(y)print(y)

As shown in the above code snippet, To encode independent categorical data, OneHotEncoder is used, and the remainder is set to ‘passthrough’ so that the remaining features are not canceled out. The result is converted into an array using NumPy.

For encoding the Dependent variable, LabelEncoder from the scikit-learn module is used.

Step 5] Feature Scaling

Feature scaling helps in putting the values of independent variables or features in the same range. It is not always necessary to apply feature scaling for all models. Some models automatically compensate for high values using a flow co-efficient.

Two techniques used are Standardization and Normalization.

I will be using Standardization. (Using any technique, doesn’t cause any significant change in output. )

# Formula of Standardization 
# Xstand = [x- mean(x)]/standard_deviation(x)from sklearn.preprocessing import StandardScalersc = StandardScaler()X=sc.fit_transform(X)print(X)

Feature scaling returns the output, which is scaled in the same range.

Step 6] Splitting dataset into train and test set

This is an important step. We will always need to split data into train and test set to train and test our model, respectively. Usually, 80% of data is used for training data, and 20% is used to test the model on future testing data.

To avoid overfitting, the data should be trained on limited test data. However, when the training set is insufficient, the model will generate errors for the test set. Hence the selection and splitting should be appropriately made.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0 )

This sums up all the tools required to process the data. For almost all models, we will require Importing Libraries, Importing Datasets, and Splitting Data into Train and Test set. We may sometimes require the other mentioned tools.