Regression refers to a machine learning technique where we focus on creating a hypothesis model that tries to predict continuous values after analyzing patterns.

Hypothesis Model

Ques1: "What is this hypothesis function and how it will be created by the model?"

So to understand this we use simpler data that will help us to understand how these things work...

Regression Data Sample

Dependent Variable (given feature)	Independent Variable (target)
1	6
2	12
3	18
4	24
5	30

"What are these independent variable (X) and dependent variable (y)? 😲"

"Independent Variables / features: " So anything that helps me in formulating my equation is Independent Variable. We say it as features. Too technical? Understand it by this example: "I have a cat." How will I know that this is a cat? Obviously, we know by their features: spooky eyes, meow sound, rat killer, right? In the same way, features are important for prediction. From this example, you also understand that correct selection of features improves my algorithm...

"Dependent Variable / Target: " My prediction values. If we consider the same example, after getting all the info about eyes, tail, sound, and other things, it will help me to categorize whether this animal is a cat or not, right? So the prediction of whether it is a cat or not is my dependent feature. As the output is completely dependent on features, that's why we call it a Dependent variable.

"First understand what hypothesis is?" Hypothesis in case of regression just means to establish a mapping function. Think of like creating F(x) or y for every x that responds closer to actual output.

If we have to create a hypothesis, according to me, my first step is to understand the relationship between variables, right? So let's have a look at the scatterplot of the given data.

Have you noticed something? We can think of the equation for the given data as: No equation yet

"I know you are confused how I am doing this, I will tell you in the upcoming journey."

If you have analyzed it, machine algorithms are working in the same way as human minds interpret. I know that we get stressed with complex calculations. Can we say that's why ML is introduced? It's somehow like I know how my mind works, and then I formulate the process of minds toward a problem and use machines to generalize this process in various tasks. Right? I am also amazed at how cool our scientists and engineers are...

My hypothesis function always try to formulate equation in terms of y = m x + c

Linear Regression

Linear regression is the one you only have to understand. Once you grabs the concept used in linear Regression there are only few changes in other model functionalities.

First thing first right?? "Linear Regression as the topic suggests that the hypothesis made by function must be linear (in form of "y = mx+c")."

Second thing : "Linear Regression is always applied when there is only 1 independent feature used to predict target variable."

How Linear Regression Works? 😲

Since to make our work easy, we focus on visuals to understand the working and break it into steps.

We have some given data:

Regression Data Sample

X	y
1	6
2	12
3	18
4	24
5	30

We plot the data into a 2D plane:

Step-by-Step OLS Calculation

Step 1: Calculate the Sums

Sum of X ()

Sum of y ()

Sum of XY ()

Sum of X² ()

Step 2: Calculate the Means

Mean of X ()

Mean of y ()

OLS Formula Application

The formula for the slope (m) is:
m = [n * Σ(Xy) - ΣX * Σy] / [n * Σ(X²) - (ΣX)²]

The formula for the intercept (c) is:
c = [Σy - m * ΣX] / n

Step 3: Calculate the Slope (m) and Intercept (c)

Slope (m) =

Intercept (c) =

Finally, we use the calculated values to draw the best fit line:

Multiple Regression

Multiple regression is used when we have multiple independent features (predictors) for predicting an outcome variable (dependent variable). Instead of a simple line of best fit, we are finding a hyperplane in higher dimensions.

The equation for multiple regression is: y = a + b1x1 + b2x2 + ... + bnxn.
We can transform this equation into a matrix form: Y = AX + B, where:

Y is the output (dependent variable)
A is the matrix of independent features (predictors)
X is the vector of coefficients (slopes)
B is the error term (residuals)

Looks like :

\[ \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{12} & \dots & x_{1p} \\ 1 & x_{21} & x_{22} & \dots & x_{2p} \\ 1 & x_{31} & x_{32} & \dots & x_{3p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \dots & x_{np} \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \vdots \\ \epsilon_n \end{pmatrix} \]

Linear Models Solved using System of Linear Equation

A regression problem reduces to solving systems of linear equations of the form Ax=b.Here, A and b are known, and x is the unknown. We can think of x as our model. In other words, we want to solve the system for x, and hence, x is the variable that relates the observations in Ato the measures in b.

Here, A and b are known, and x is the unknown. We can think of x as our model. In other words, we want to solve the system for x , and hence, x is the variable that relates the observations in A to the measures in b .

Also, note that the matrix A might have different shapes. First, A could be a square matrix. Yes, it is very unlikely (for the situations we usually encounter in data science) but otherwise possible.

n = m

n < m

n > m

Second, A could have more columns than rows. In this scenario, A would have a short and wide shape. And lastly, (and that is the most usual case in data science), the matrix A assumes the form of a tall and skinny matrix, with many more rows than columns. But why should I care for the shape of the matrix A ? Interestingly, the shape of A will dictates whether the linear system of equations has a solution, has infinitely many solutions, or does not have a solution at all. Let’s start with the boring case. If the matrix is squared (number of rows equals the number of columns) and it is invertible, meaning that the matrix A has full rank (all columns are linearly independent), that pretty solves the problem.

Ax = b

However, if the matrix has more columns than it has rows, we are likely dealing with the case where there are infinitely many solutions. To visualize this curious scenario, picture a 3×6 matrix, i.e., 3 rows and 6 columns. We can think of it as having a 3D space and 6 different vectors that we can use to span the 3D space. However, to span a 3D space, we only need 3 linearly independent vectors, but we have 6! This leaves 3 dependent vectors that can be used to formulate infinitely many solutions.

Finally, by analogy, if we have a matrix A with more rows than columns, we can view it as trying to span a very high-dimensional space with fewer vectors than we would need. For instance, picture a matrix with 6 rows and 2 columns. Here, we have a 6D space, but we only got 2 vectors to span it. It does not matter how much we try it, in the best case, we can only span a plane on 6D. And that is crucial because we only have a solution to Ax=b if the vector b is in the column space of A . But here, the column space of A spans 2D subspace (a plane) on a much larger 6D space. This makes the probability of the vector b to be in the subspace spanned by the columns of A improbable.

To visualize how unlikely it is, picture a 3D space and a subspace spanned by two vectors (a plane in 3D). Now, imagine you choose 3 values at random. This will give you a point on the 3D space. Now, ask yourself: what is the probability that my randomly chosen point will be on the plane?

Nonetheless, in situations where we do not have a solution for a linear system of equations Ax=b (or we have infinitely many solutions), we still want to do our best. And to do this, we need to find the best approximate solution. Here is where the SVD (Singular Value Decomposition) kicks in.

SVD Comes into picture

The main idea of the singular value decomposition, or SVD, is that we can decompose a matrix A , of any shape, into the product of 3 other matrices.

A = U S V ^T

V^T

The matrices U and VT have a very special property. They are unitary matrices. One of the main benefits of having unitary matrices like U and VT is that if we multiply one of these matrices by its transpose (or the other way around), the result equals the identity matrix. On the other hand, the matrix Σ is diagonal, and it stores non-negative singular values ordered by relevance.

Ax = b
USVTx = b

                            (USVT)-1(USVT)x
                            = (USVT)-1b

                            VS-1U-1USVTx
                            = VS-1U-1b
                        
̂x =
                            VS-1U-1b
                        
̂x = A+b

SVD Applied on Diabetes

Lets See By Code :

                            
            // import libraries
            import numpy
            import pandas
            import sklearn
            import numpy as np
            import matplotlib.pyplot as plt
            from sklearn.datasets import load_diabetes
            from sklearn.model_selection import train_test_split
            from numpy.linalg import svd
    
    
    
            // load the data
            data = load_diabetes()
            A = data.data
            b = data.target
    
    
    
            // adding bias as per SVD 
            A = np.column_stack([np.ones(A.shape[0]), A])
    
            
            
            // train test split of data
            X_train, X_test, y_train, y_test = train_test_split(A, b, test_size=0.50, random_state=42)
    
            
            
            // SVD matrix breakdown
            U, S, Vt = svd(X_train, full_matrices=False)
    
    
            // psuedo inverse calulation and calulating predicted point
            S_inv = np.diag(1 / S)
            x_hat = Vt.T @ S_inv @ U.T @ y_train
            train_predictions = X_train @ x_hat
            test_predictions = X_test @ x_hat
            train_mse = np.mean((train_predictions - y_train) ** 2)
            test_mse = np.mean((test_predictions - y_test) ** 2)
            print("Train Mean Squared Error:", train_mse)
            print("Test Mean Squared Error:", test_mse)

REGRESSION

Hypothesis Model

Ques1: "What is this hypothesis function and how it will be created by the model?"

Regression Data Sample

Types of Regression

Linear Regression

How Linear Regression Works? 😲

Regression Data Sample

Step-by-Step OLS Calculation

Step 1: Calculate the Sums

Step 2: Calculate the Means

OLS Formula Application

Step 3: Calculate the Slope (m) and Intercept (c)

Multiple Regression

Linear Models Solved using System of Linear Equation

SVD Comes into picture

SVD Applied on Diabetes

Lets See By Code :

Line predicted by SVD

Test data best fit

Residuals (Actual - Prediction)

Training Set Prediction Overview

Test Set Prediction Overview

Understanding How Algorithms Reach the Best Fit Line

Concept of Squared Error Function

Squared Error Minimization & Gradient Descent

Input Data

Gradient Descent Animation

Understanding the Impact of Learning Rate in Gradient Descent

Concept of Learning Rate

Learning Rate and Gradient Descent

Set Your Learning Rate

Convergence Algorithm