Regression refers to a machine learning technique where we focus on creating a hypothesis model that tries to predict continuous values after analyzing patterns.
So to understand this we use simpler data that will help us to understand how these things work...
Dependent Variable (given feature) | Independent Variable (target) |
---|---|
1 | 6 |
2 | 12 |
3 | 18 |
4 | 24 |
5 | 30 |
"What are these independent variable (X)
and dependent variable (y)? 😲"
"Independent Variables / features: " So anything that helps me in
formulating my equation is
Independent Variable. We say it as features. Too technical? Understand it by this example:
"I have a cat." How will I know that this is a cat? Obviously, we know by their features:
spooky eyes, meow sound, rat killer, right? In the same way, features are important for
prediction. From this example, you also understand that correct selection of features
improves my algorithm...
"Dependent Variable / Target: " My prediction values. If we
consider the same example, after
getting all the info about eyes, tail, sound, and other things, it will help me to
categorize whether this animal is a cat or not, right? So the prediction of whether it is a
cat or not is my dependent feature. As the output is completely dependent on features,
that's why we call it a Dependent variable.
"First understand what hypothesis is?" Hypothesis in case of regression just means to establish a mapping function. Think of like creating F(x) or y for every x that responds closer to actual output.
If we have to create a hypothesis, according to me, my first step is to understand the relationship between variables, right? So let's have a look at the scatterplot of the given data.
"I know you are confused how I am doing this, I will tell you in the upcoming journey."
If you have analyzed it, machine algorithms are working in the same way as human minds interpret. I know that we get stressed with complex calculations. Can we say that's why ML is introduced? It's somehow like I know how my mind works, and then I formulate the process of minds toward a problem and use machines to generalize this process in various tasks. Right? I am also amazed at how cool our scientists and engineers are...
My hypothesis function always try to formulate equation in terms of y = m x + c
There are multiple models for regression anlysis. But broadly we can categorize them into 3 parts ::
Linear regression is the one you only have to understand. Once you grabs the concept used in linear Regression there are only few changes in other model functionalities.
First thing first right?? "Linear Regression as the topic suggests that the hypothesis
made by function must be linear (in form of "y = mx+c")."
Second thing : "Linear Regression is always applied when there is only 1 independent feature used to
predict target variable."
Since to make our work easy, we focus on visuals to understand the working and break it into steps.
We have some given data:
X | y |
---|---|
1 | 6 |
2 | 12 |
3 | 18 |
4 | 24 |
5 | 30 |
We plot the data into a 2D plane:
Sum of X (15.00)
Sum of y (90.00)
Sum of XY (330.00)
Sum of X² (55.00)
Mean of X (3.00)
Mean of y (18.00)
The formula for the slope (m) is:
m = [n * Σ(Xy) - ΣX * Σy] / [n * Σ(X²) - (ΣX)²]
m = [5 * 330.00 - 15.00 * 90.00] / [5 * 55.00 - (15.00)²] = 6.00
The formula for the intercept (c) is:
c = [Σy - m * ΣX] / n
c = [90.00 - 6.00 * 15.00] / 5 = 0.00
Slope (m) = 6.00
Intercept (c) = 0.00
Finally, we use the calculated values to draw the best fit line:
Multiple regression is used when we have multiple independent features (predictors) for predicting an outcome variable (dependent variable). Instead of a simple line of best fit, we are finding a hyperplane in higher dimensions.
The equation for multiple regression is:
y = a + b1x1 + b2x2 + ... + bnxn.
We can transform this equation into a matrix form:
Y = AX + B, where:
A regression problem reduces to solving systems of linear equations of the form Ax=b.Here, A and b are known, and x is the unknown. We can think of x as our model. In other words, we want to solve the system for x, and hence, x is the variable that relates the observations in Ato the measures in b.
Here, A and b are known, and x is the unknown. We can think of x as our model. In other words, we want to solve the system for x , and hence, x is the variable that relates the observations in A to the measures in b .
Also, note that the matrix A might have different shapes. First, A could be a square matrix. Yes, it is very unlikely (for the situations we usually encounter in data science) but otherwise possible.
Second, A could have more columns than rows. In this scenario, A would have a short and wide shape. And lastly, (and that is the most usual case in data science), the matrix A assumes the form of a tall and skinny matrix, with many more rows than columns. But why should I care for the shape of the matrix A ? Interestingly, the shape of A will dictates whether the linear system of equations has a solution, has infinitely many solutions, or does not have a solution at all. Let’s start with the boring case. If the matrix is squared (number of rows equals the number of columns) and it is invertible, meaning that the matrix A has full rank (all columns are linearly independent), that pretty solves the problem.
Ax = b
However, if the matrix has more columns than it has rows, we are likely dealing with the case where there are infinitely many solutions. To visualize this curious scenario, picture a 3×6 matrix, i.e., 3 rows and 6 columns. We can think of it as having a 3D space and 6 different vectors that we can use to span the 3D space. However, to span a 3D space, we only need 3 linearly independent vectors, but we have 6! This leaves 3 dependent vectors that can be used to formulate infinitely many solutions.
Finally, by analogy, if we have a matrix A with more rows than columns, we can view it as trying to span a very high-dimensional space with fewer vectors than we would need. For instance, picture a matrix with 6 rows and 2 columns. Here, we have a 6D space, but we only got 2 vectors to span it. It does not matter how much we try it, in the best case, we can only span a plane on 6D. And that is crucial because we only have a solution to Ax=b if the vector b is in the column space of A . But here, the column space of A spans 2D subspace (a plane) on a much larger 6D space. This makes the probability of the vector b to be in the subspace spanned by the columns of A improbable.
To visualize how unlikely it is, picture a 3D space and a subspace spanned by two vectors (a plane in 3D). Now, imagine you choose 3 values at random. This will give you a point on the 3D space. Now, ask yourself: what is the probability that my randomly chosen point will be on the plane?
Nonetheless, in situations where we do not have a solution for a linear system of equations Ax=b (or we have infinitely many solutions), we still want to do our best. And to do this, we need to find the best approximate solution. Here is where the SVD (Singular Value Decomposition) kicks in.
The main idea of the singular value decomposition, or SVD, is that we can decompose a matrix A , of any shape, into the product of 3 other matrices.
The matrices U and VT have a very special property. They are unitary matrices. One of the main benefits of having unitary matrices like U and VT is that if we multiply one of these matrices by its transpose (or the other way around), the result equals the identity matrix. On the other hand, the matrix Σ is diagonal, and it stores non-negative singular values ordered by relevance.
// import libraries
import numpy
import pandas
import sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from numpy.linalg import svd
// load the data
data = load_diabetes()
A = data.data
b = data.target
// adding bias as per SVD
A = np.column_stack([np.ones(A.shape[0]), A])
// train test split of data
X_train, X_test, y_train, y_test = train_test_split(A, b, test_size=0.50, random_state=42)
// SVD matrix breakdown
U, S, Vt = svd(X_train, full_matrices=False)
// psuedo inverse calulation and calulating predicted point
S_inv = np.diag(1 / S)
x_hat = Vt.T @ S_inv @ U.T @ y_train
train_predictions = X_train @ x_hat
test_predictions = X_test @ x_hat
train_mse = np.mean((train_predictions - y_train) ** 2)
test_mse = np.mean((test_predictions - y_test) ** 2)
print("Train Mean Squared Error:", train_mse)
print("Test Mean Squared Error:", test_mse)
In machine learning, the journey to finding the optimal fit line is intriguing! Let's explore how we minimize error to achieve this.
Imagine you're trying to walk to a certain point but don't know the exact distance. The key question is: how do you figure out how far you need to walk to reach your destination? In machine learning, we calculate how far off our predictions are from the actual values. This is where the squared error function comes in.
We calculate the difference between predicted and actual points, square it to ensure all errors are positive, and sum these squared differences to obtain the total error or "cost." Our goal is to minimize this total cost by adjusting the model's parameters, gradually getting closer to the best fit line.
Explore how we minimize the squared error between actual and predicted points!
X Value | Actual Y Value | Predicted Y Value |
---|---|---|
The learning rate is a critical hyperparameter in the journey of gradient descent. Let's explore how different learning rates affect convergence!
In gradient descent, the learning rate determines the size of the steps taken towards the minimum of the loss function. If the learning rate is too high, we may overshoot the minimum, while a very low learning rate can lead to a long convergence time.
Visualize how different learning rates impact the optimization process!
Where: