Proving the Convexity of Log-Loss for Logistic Regression
Unpacking Log Loss Error Function’s Impact on Logistic Regression
Author(s): Pratik Shukla
“Courage is like a muscle. We strengthen it by use.” — Ruth Gordo
Table of Contents:
Introduction
In this tutorial, we will see why the log-loss function works better in logistic regression. Here, our goal is to prove that the log-loss function is a convex function for logistic regression. Once we prove that the log-loss function is convex for logistic regression, we can establish that it’s a better choice for the loss function.
Logistic regression is a widely used statistical technique for modeling binary classification problems. In this method, the log-odds of the outcome variable is modeled as a linear combination of the predictor variables. To estimate the parameters of the model, the maximum likelihood method is used, which involves optimizing the log-likelihood function. The log-likelihood function for logistic regression is typically expressed as the negative sum of the log-likelihoods of each observation. This function is known as the log-loss function or binary cross-entropy loss. In this blog post, we will explore the convexity of the log-loss function and why it is an essential property in optimization algorithms used in logistic regression. We will also provide a proof of the convexity of the log-loss function.
Proof of convexity of the log-loss function for logistic regression:
Let’s mathematically prove that the log-loss function for logistic regression is convex.
We saw in the previous tutorial that a function is said to be a convex function if its second derivative is >0. So, here we’ll take the log-loss function and find its second derivative to see whether it’s >0 or not. If it’s >0, then we can say that it is a convex function.
Here we are going to consider the case of a single trial to simplify the calculations.
Step — 1:
The following is a mathematical definition of the binary cross-entropy loss function (for a single trial).
Step — 2:
The following is the predicted value (ŷ) for logistic regression.
Step — 3:
In the following image, z represents the linear transformation.
Step — 4:
After that, we are modifying Step — 1 to reflect the values of Step — 3 and Step — 2.
Step — 5:
Next, we are simplifying the terms in Step — 4.
Step — 6:
Next, we are further simplifying the terms in Step — 5.
Step — 7:
The following is the quotient rule for logarithms.
Step — 8:
Next, we are using the equation from Step — 7 to further simplify Step — 6.
Step — 9:
In Step — 8, the value of log(1) is going to be 0.
Step — 10:
Next, we are rewriting Step — 8 with the remaining terms.
Step — 11:
The following is the power rule for logarithms.
Step — 12:
Next, we will use the power rule of logarithms to simplify the equation in Step — 10.
Step — 13:
Next, we are replacing the values in Step — 10 with the values in Step — 12.
Step — 14:
Next, we are substituting the value of Step — 13 into Step — 10.
Step — 15:
Next, we are multiplying Step — 14 by (-1) on both sides.
Finding the First Derivative:
Step — 16:
Next, we are going to find the first derivative of f(x).
Step — 17:
Here we are distributing the partial differentiation sign to each term.
Step — 18:
Here we are applying the derivative rules.
Step — 19:
Here we are finding the partial derivative of the last term of Step — 18.
Step — 20:
Here we are finding the partial derivative of the first term of Step — 18.
Step — 21:
Here we are putting together the results of Step — 19 and Step — 20.
Step — 22:
Next, we are rearranging the terms of the equation in Step — 21.
Step — 23:
Next, we are rewriting the equation in Step — 22.
Finding the Second Derivative:
Step — 24:
Next, we are going to find the second derivative of the function f(x).
Step — 25:
Here we are distributing the partial derivative to each term.
Step — 26:
Next, we are simplifying the equation in Step — 25 to remove redundant terms.
Step — 27:
Here is the derivative rule for 1/f(x).
Step — 28:
Next, we are finding the relevant term to plug-in in Step — 27.
Step — 29:
Here we are finding the partial derivative term for Step — 27.
Step — 30:
Here we are finding the squared term for Step — 27.
Step — 31:
Here we are putting together all the terms of Step — 27.
Step — 32:
Here we are simplifying the equation in Step — 31.
Step — 33:
Next, we are putting together all the values in Step — 26.
Step — 34:
Next, we are further simplifying the terms in Step — 33.
Alright! So, now we have the second derivative of the function f(x). Next, we need to find out whether this will be >0 for all the values of x or not. If it is >0 for all the values of x, then we can say that the binary cross-entropy loss is convex for logistic regression.
As we can see that the following terms from Step — 34 are always going to be ≥0 because the square of any number is always ≥0.
Now, we need to determine whether or not the value of e^(-wx) is >0. To do that, let’s first find the range of the function e^(-wx) in the domain [-∞,+∞]. To further simplify the calculations, we will consider the function e^-x instead of e^-wx. Please note that scaling a function does not change the range of the function if the domain is [-∞,+∞]. Let’s first plot the graph of e^-x to understand its range.
From the above graph we can derive the following conclusion:
- As the value of x moves towards negative infinity (-∞), the value of e^-x moves towards infinity (+∞).
2. As the value of x moves towards 0, the value of e^-x moves towards 1.
3. As the value of x moves towards positive infinity (+∞), the value of e^-x moves towards 0.
So, we can say that the range of the function f(x)=e^-x is [0,+∞]. Based on the calculations, we can say that the function f(x)=e^-wx is always going to be ≥0.
Alright! So, we have concluded that all the terms of the equation in Step — 34 are≥0. Hence, we can say that the function f(x) is a convex function for logistic regression.
Important Note:
If the value of the second derivative of the function is 0, then there is a possibility that the function is neither concave nor convex. But, let’s not worry too much about it!
A Visual Look at BCE for Logistic Regression:
The binary cross entropy function for logistic regression is given by…
Now, we know that this is a binary classification problem. So, there can be only two possible values for Yi (0 or 1).
Step — 1:
The value of cost function when Yi=0.
Step — 2:
Now, let’s consider only one training example.
Step — 3:
Now, let’s say we have only one training example. It means that n=1. So, the value of the cost function when Y=0,
Step — 4:
Now, let’s say we have only one training example. It means that n=1. So, the value of the cost function when Y=1,
Step — 5:
Now, let’s plot the function graph in Step — 3.
Step — 6:
Now, let’s plot the function graph in Step — 4.
Step — 7:
Let’s put the graphs in Step — 5 and Step — 6 together.
The above graphs follow the definition of the convex function (“A function of a single variable is called a convex function if no line segments joining two points on the graph lie below the graph at any point”). So, we can say that the function is convex.
Conclusion:
In conclusion, we have explored the concept of convexity and its importance in optimization algorithms used in logistic regression. We have demonstrated that the log-loss function is convex, which implies that its optimization problem has a unique global minimum. This property is crucial for ensuring the stability and convergence of optimization algorithms used in logistic regression. By proving the convexity of the log-loss function, we have shown that the optimization problem in logistic regression is well-posed and can be efficiently solved using standard convex optimization methods. Moreover, our proof provides a deeper understanding of the mathematical foundations of logistic regression and lays the groundwork for further research and development in this field.
Citation:
For attribution in academic contexts, please cite this work as:
Shukla, et al., “Proving the Convexity of Log Loss for Logistic Regression”, Towards AI, 2023
BibTex Citation:
@article{pratik_2023,
title={Proving the Convexity of Log Loss for Logistic Regression},
url={https://pub.towardsai.net/proving-the-convexity-of-log-loss-for-logistic-regression-49161798d0f3},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla},
editor={Binal, Dave},
year={2023},
month={Feb}
}