Let’s Really, Really Talk About Linear Regression.
Linear Regression. It’s one of the most basic forms of regression modeling. It is best suited when the target value is expected to be a somewhat linear combination of the features. Some people might say that this is too naive of an algorithm to be used effectively. However, I believe it catches many people off guard when they realize the power that this simple model has to offer. If there are any signs of linear relationships within the features, linear regression and its variants are some of the most powerful tools you can have in your toolbox. Furthermore, the variants mentioned previously are well tailored to specific needs of the user, providing much flexibility while promising robustness too.
Due to the seemingly simple logic of the algorithm, I think a lot of the people using it do not truly understand what is going under the hood of this algorithm. I believe most will think of it as a simple plotting of a line (plane, hyperplane for higher dimensions) to best-fit the datapoints.

I wouldn’t blame them as this was what we were taught in middle school, when learning about graphs and plotting the ‘best-fit’ line. However, there is so much more going under the hood to provide users with mathematically rigorous and powerful results.
In this article, I wish to walk you through the simple understanding of linear regression and its variants, and if you wish, dive into the mathematical rigor of the algorithms and how they are fitted on the dataset. I hope that with understanding of the mathematical rigor of these algorithms, you would develop a new sense of appreciation for them and be closer to truly understanding and mastering them. Let’s jump right in to it then!
Linear Regression
In mathematical equation, a linear regression can be understood as such.

where w0 refers to the intercept and w1 to wp represents the coefficient vectors. The x values are the different features of the data. As we can see, this is a linear relationship, where all the coefficients are linear in nature without any polynomial or other forms. Once you’ve understood this, the next question you could have is, how is this algorithm trained to find the most optimal coefficients? I mean, that is the basis of any machine learning algorithms right?
So how does it do so? Well, to understand that, you first need to realize that linear regression models are being fitted to minimize the residual sum of squares between the observed targets in the dataset and the targets predicted by the algorithm. Mathematically, it can be shown as such.

The algorithm’s aim is to find the set of w, coefficients, that will minimize this residual sum of squares. Now the question is how this minimization is performed. The key here is Ordinary Least Squares (OLS). Before I talk about how this is performed, let me shed some light on the assumptions that need to be held true for this minimization to work.
Key assumptions
The main assumption is that a linear independence between the features is established, also known as having a full column rank. When features are correlated to one another and columns become linearly dependent, matrices in linear algebra turn out to be singular, also known as noninvertible. If this happens, there can be no closed-form solution to a linear equation, rendering our approach to solving linear regression powerless. The phenomenon of having multiple linearly dependent columns is also known as multicollinearity, where 2 or more columns are highly correlated. If this assumption is not met, our predictions become highly sensitive to random errors in the target data.
Ordinary Least Squares
Now let us talk about the actual workings of OLS. We have to remember that our aim is to minimize the residual sum of squares. This was mathematically written out in the above section. In order to minimize this, we have to find the value of W that minimizes the residual sum of squares. To do so, we employ the handy partial derivative. We differentiate the residual sum of squares with respect to W and set the derivative to 0.

To find the coefficients, W, we would need to rearrange the terms.

By applying fundamental linear algebra logic and manipulation, we are able to find W, the estimated regression coefficients that minimize the residual sum of squares. The reason for the linear independence assumption explained earlier on is that the X^TX has to be non-singular to be inverted. Now with this newly found W, we are able to predict for new data points by multiplying with new X data.
I believe if you have learnt about basic linear algebra in high school or university, all of this knowledge would not sound so foreign to you. However, I believe there is more interesting knowledge to be discovered when you dive into linear regression’s two well known variants, the Ridge and Lasso regression. If you wish to know more about how these algorithm work too, keep on reading!
Ridge Regression
So what is this variant of linear regression? Ridge regression’s equation does not look too different from the former equation, except for an added regularization portion. This regularization takes on the role of imposing a penalty on the size of the coefficients. By doing so, it aims to tackle the existence of multicollinearity in its features. The new function we aim to minimize can be shown as such.

The lambda represents the regularization term. The greater the lambda, the greater the regularization, the stronger it targets multicollinearity. To solve for the coefficients, the equation is very similar too.

Solving this will provide you with the necessary weight matrix to be able to predict on future data. However, the difference between vanilla linear regression and ridge regression is the power to handle multicollinearity, the power to combat against possible linear dependence within its features. Linear dependence or multicollinearity introduces the problem of the term X^TX being nearly singular. When this happens, directly computing the inverse in the above equation can lead to numerical instability.
So to combat this possible scenario, we utilize Singular Value Decomposition (SVD) to obtain a closed form solution for the coefficient matrix. Additionally, by not performing any inverse matrix operations, we can save computation time too as inverse operations are known to be computation expensive, O(n³).
Singular Value Decomposition
To understand how SVD is used, one must first understand the theory of SVD. Simply put, SVD decomposes a single matrix into three different ones. It tells us that every linear transformation is a composition of three fundamental actions. The first is a rotation/reflection, followed by a linear dilation/contraction and ending with another rotation/reflection. It is commonly represented as such mathematically.

The U is a m by m matrix of the orthonormal eigenvectors of XX^T. The V transpose is a transpose of a n by n matrix containing the orthonormal eigenvectors of X^TX. Finally, the sigma is a diagonal matrix with elements equal to the root of the positive eigenvalues of XX^T or X^TX.
This technique is used in many different applications and one of them is to find the pseudo-inverse of a matrix. The pseudo inverse of a matrix is the generalization of the matrix inverse that may not be invertible (such as a singular matrix). By doing so, we are able to solve homogenous linear equations that otherwise might not be solved. This is really a brief overview of SVD so I encourage you to read more about it. SVDs cover a significant portion of linear algebra and understanding this technique thoroughly will benefit you in the long run.
Now, let’s see how we can use SVD to arrive at a solution for our coefficient matrix, X.

It may look daunting at first, but I have elaborated it out step by step so don’t be too worried. Patiently look through it and try to understand how the last equation was derived.
As you can see, we have successfully eliminated the inverse sign from the original equation. By using SVD, we were able to obtain a closed form solution that saves computation and also prevent numerical instability in cases where the feature matrix is non-invertible.
This version of solving for the coefficients can also be seen being utilized in the code of scikit-learn’s Ridge Regression.
def _solve_svd(X, y, alpha, xp=None):
xp, _ = get_namespace(X, xp=xp)
U, s, Vt = xp.linalg.svd(X, full_matrices=False)
idx = s > 1e-15 # same default value as scipy.linalg.pinv
s_nnz = s[idx][:, None]
UTy = U.T @ y
d = xp.zeros((s.shape[0], alpha.shape[0]), dtype=X.dtype, device=device(X))
d[idx] = s_nnz / (s_nnz**2 + alpha)
d_UT_y = d * UTy
return (Vt.T @ d_UT_y).T
Lasso Regression
Remember that the previous algorithm, Ridge regression is mainly meant to combat existence of collinearity. Lasso regression, on the other hand, is meant for feature selection. It prefers solutions with fewer non-zero coefficients, encouraging only the necessary feature selections. For this algorithm, it is trying to minimize this cost function.

This time, the regularization term introduced is known as the L1 penalty. For this cost function, we have to use iterative optimization methods as the L1 penalty, a linear term, introduces non-differentiability into the equation. This is where coordinate descent is used. Coordinate descent is carried out by fixing all other coordinates and minimizing the cost function over a single coordinate at a time, and doing this iteratively. This is extremely useful when not all terms are differentiable.
Coordinate Descent
For those of you who are already familiar of gradient descent, as it is a very famous algorithm in the deep learning community, it is just a small change so following along with the math will not be too challenging. For those of you who are not confident with calculus or linear algebra, it might take a while for you to grasp this working, but please follow through because once you realize the beauty of the workings, you will be able to really understanding what this ‘simple’ algorithm is doing.

To find the necessary x values, we need a way to deal with the subdifferential. The subdifferential is necessary here because of the non-differentiable L1 penalty. At the 0 value, the subdifferential case becomes non-trivial, and it can take any value in [-1,1]. In order to combat this, we would need to use soft thresholding.
Soft Thresholding

By doing so, we would be able to obtain clear values for our different coordinates. The soft thresholding operator shrinks the z value towards 0, and sets it to 0 if absolute value of z is small enough. This is where you can see the role of the lambda function. If the lambda function (regularization value) is large, gamma value will be large, in turn making more coefficients 0, encouraging a sparser solution.
Conclusion
In this article, I have talked about three different types of linear regression. The vanilla version, Ridge regression and the Lasso regression. Due to their seemingly simplistic nature, I often see these models getting sidelined for more fancier models such as Gradient Boosting and Bagging options. However, in the presence of linearly related features or a tabular dataset, a lightweight model such as linear regression might be all one needs to provide a quick and robust solution. I hope the mathematical rigor I have provided will shed some light into the inner workings of these models and inspire readers to truly understand these algorithms.
If you have liked this story, please make sure to leave a clap and share it with your friends! I have written other interesting stories so please feel free to take a look at it! Any comments or suggestions are welcome too!