One Stop For Logistic Regression

Priyansh Soni
Towards AI
Published in
8 min readFeb 15, 2022

--

Logistic Regression? Why is it called Regression? Is it linear? Why is it so popular? And what are log odds?

Well, all these questions are typical to the mind of every person who starts with Logistic Regression. To make things simpler, I thought of preparing this article, which might act as a one-stop destination for all the Logistic Regression you could need.

OUTLINE :

  • What is Logistic Regression?
  • Why not use Linear Regression for Classification
  • Sigmoid curves and Logistic Regression
  • Logit and Probit function
  • Why call it Regression when it is used for Classification
  • Why is it a Linear model?
  • Why the Logit Function in particular?
  • The Loss function for Logistic Regression
  • The Cost function for Logistic Regression

1. What is Logistic Regression?

Logistic regression is a supervised learning algorithm used for classification problems. It is developed in order to predict the probability of happening of an event. Unlike many classification algorithms, Logistic regression actually predicts the probability of happening of an event rather than predicting discrete outcomes on whether the event will happen(1) or not(0). Logistic Regression is a transformed version of the Linear Regression equation function and hence is called Logistic Regression.

2. Why not just use Linear Regression for classification as well?

Considering you are familiar with Linear Regression and what is a regression best-fit line, let us consider an example of a classification problem where we need to predict a binary outcome.

Q: We need to predict whether a student will pass or fail based on the hours he/she studies.
Considering Study Hours on the x-axis and Probability of Passing on the y-axis, the predictions would either be 0 (failed) or 1 (passed). Then, the graph will have points either along y = 0 or along y = 1. Now, if we fit a regression model to such a problem, with a line equation of y = ax + b, the best-fit line would look something like this :

Linear Regression fit on classification problem

Now, let’s consider a threshold to pass the exam, let's say 5 hours. So the points above the threshold will be predicted as 1(passed), and the ones below the threshold will be predicted as 0(failed). Now, such a model will predict fine in many cases. As we can see, the points below 5 hours are predicted as 0, so a person who studied 4 hours will be predicted as failed(0), and one who studied 8 hours will be predicted as passed(1).

But let’s consider the scenario for a change in study hours. For a change from 4.5 to 5.5 in study hours, the predictions will change from 0 (failed) to 1 (passed). But for a change from 7 to 10 hours, the probability will remain the same, i.e. 1(passed).

  • This indicates that the chances of a person passing by increasing the study hours by just one hour (4.5 to 5.5) are much more as compared to that of a person increasing study hours by 3 hours (7 to 10). On the contrary, the marginal change in the study hours is 1 unit, but the probability of getting passed changes drastically for the former.

Hence a marginal raise seems to be constant when a straight line fit is used for a classification problem

  • Another problem with Linear Regression fit is that the line can cross the axis, this means that the straight line can predict more than 1 and less than 0. This means that the line is force-fitting the data. A prediction greater than 1 or less than 1 makes no sense in terms of probability since probabilities can only be between 0 and 1.

Hence a Linear Regression model is a force fit to the classification problem

So we now know that a line won’t fit the model because:

  1. A marginal rise seems to be constant when a straight-line fit is used for a classification problem.
  2. A linear regression model predicts points below 0 and above 1 and hence is a force fit to the classification problem.

Thus, we need a curve that fits the model well and surpasses all the above-mentioned limitations of the linear model.
This is where Logistic regression comes into play.

The best model we can think of for the above situation would supposedly be an ‘S-curve’.

3. What is an S-curve now?

Well, the answer is simple. An S-curve is simply an S-curve. It looks like this :

Sigmoid curve

These curves are called Sigmoid Curves

If we fit such an S-curve into our classification problem, it will look something like this :

Sigmoid function on a classification problem

And this curve solves most of our problems.

  • The curve fits the data almost perfectly. It starts out from 0 and flattens out at 1.
  • A marginal change in the data cannot be constant since the curve is non-linear and provides a nice interpolation between the classes to be predicted.
  • The curve ranges from 0 to 1 (as can be seen from the image). Hence, the predictions can be well turned into probabilities (without any negative values) and thus can be used for classification.

So, instead of fitting a regression line with equation : y = ax + b, we will fit some function of this equation that gives out a sigmoid curve: f(ax + b)

There are many functions that give out a Sigmoid curve, but the most popular ones are Logit and Probit function curves.

Logit function: y = 1/(1+e^-(ax + b))

Probit function: y = ⏀(x)

The probit function is the cumulative distribution function for a normal distribution curve.

When we look at the best Logit function that fits our data to make predictions, we call it Logistic Regression

4. But why are we talking about the Logit function so much?

Well, the answer is simple — it uses a Linear model equation, provides a solid sigmoid curve, and is easy to differentiate. Since it is easy to differentiate, it will be easier to compute the gradient descent and find the global minima for the cost function.

5. Why is it called Regression?

The logit function, which produces a sigmoid curve (S-curve) does so by handling the coefficients of the linear regression line, i.e., y = ax + b.

This means that on the backend, Logistic Regression performs the same tasks as a Linear Regression model

Most people familiar with Linear Regression know that the aim of the algorithm is to estimate values for the model coefficients, i.e. to calculate the values of a1, a2, a3, …, b for the function Y= b + a₁X₁ + a₂X₂ + a₃X₃ + ….. +aₙXₙ and fit the training data with minimal error (RMSE, MSE, etc.) and predict the output Y.

Well, Logistic regression does the same thing under the hood, but with a slight addition. It gets the output ‘y’ after evaluating the model coefficients, and runs it under a function (Logit, Probit, etc.) to produce a Sigmoid curve, which results in predicting the probability of an event.

Since, the Logistic Regression model is evaluating the coefficients of the linear regression equation under-the-hood, which are then passed through a function, it is termed as a Regression and not as a Classification algorithm.

6. Is it a Linear or a Non-Linear model?

This is a very frequently asked question and often confuses a lot.

The Sigmoid curve is a non-linear curve, and if a function has a non-linear curve, then it should be non-linear.
Well, regardless of the above statement, Logistic Regression is considered as a Linear model.
This can be proved with a little rearrangement of the logit function.

If we take log on both sides and then rearrange the equation, we will get

The ratio y/1-y is called the odd’s ratio. It is the probability of the event happening (y) by the probability of the event not happening(1-y). Now looking at the right side of the equation, it forms a linear relation (ax + b).

Since the logarithm of odd’s ratio is a linear function of x, Logistic regression is termed as a linear model.

However, the curve of logistic regression is non-linear, and hence the function governing the prediction (y) is non-linear, but the model is linear.

7. Loss Function for Logistic Regression

Loss function, as we know, is a function that accounts for the error in our prediction. When done for one data point, it is termed as the Loss function, and when done for the entire dataset, it is termed as the Cost function.

So practically, the loss function for Logistic Regression should be able to give more error when we predict a point far opposite to what it actually was, and less error on predicting it close to what is actually was.
By this, I mean for the actual label(y) to be 1, if the predicted label(ŷ) is 0.14, then the error should be huge, and if the predicted label is 0.98, the error should be very small.
This means that for y=1 if ŷ is closer to 1, the error should be less, and if ŷ is closer to 0, the error should be high.

The above statement can be formulated as :

For the above formula to make sense mathematically, we define the Log-Loss

For actual label(y) = 1, the loss will be -log(ŷ), and for y = 0, the loss will be -log(1-ŷ).

Example. Consider predicted value ŷ = 1 and ŷ = 0

The above function can be represented in terms of the below formula:

The above equation is the Loss function for Logistic Regression and is called LogLoss

8. Cost Function for Logistic Regression

The loss function, when computed for the entire dataset, is called the Cost Function.

where m is the number of data samples, hence, the Cost Function for Logistic Regression can be written as:

Well, this was all about Logistic Regression. The logistic regression algorithm has grown significantly popular in Machine learning classification problems involving medical diagnosis, churn prediction, credit risk management, etc. due to its ease of implementation and understanding.
I hope this article helps.

--

--

Bet my articles are the best easy explanation you could get anywhere. I am a product enthusiast nurturing technology to transform change.