
In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (y) which is a single ordinary numerical value that is used to measure a quantity or variable, and one or more explanatory variables (x). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.
When you plot data points on a graph and then draw a line indicating their average on that graph you are making a linear approximation of the distribution on that graph of a data points.
The formulation for such a process is given:
y = mx + b
y is the temperature in Celsius—the value we’re trying to predict. m is the slope of the line. x is the degrees of C—the value of our input feature. b is the y-intercept.
in machine learning, you’ll write the equation for a model slightly differently:
y’= b + w1x1
y’ is the predicted label (a desired output). b is the bias (the y-intercept), w is the weight of feature 1. Weight is the same concept as the “slope” in the traditional equation of a line. x is a feature (a known input).
A linear regression model with two predictor variables can be expressed with the following equation:
Y = B0 + B1*X1 + B2*X2 + e.
The variables in the model are:
Y, the response variable; X1, the first predictor variable; X2, the second predictor variable; and e, the residual error, which is an unmeasured variable. The parameters in the model are:
B0, the Y-intercept; B1, the first regression coefficient; and B2, the second regression coefficient. One example would be a model of the height of a shrub (Y) based on the amount of bacteria in the soil (X1) and whether the plant is located in partial or full sun (X2). Y is the dependent variable you are trying to predict, X1, X2 and so on are the independent variables you are using to predict it, b1, b2 and so on are the coefficients or multipliers that describe the size of the effect the independent variables are having on your dependent variable Y
Here is an example of how you can create a Pandas DataFrame from a CSV file and use scikit-learn to perform a basic linear regression on test data:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Load the data into a Pandas DataFrame
df = pd.read_csv('data.csv')
# Split the data into features and target
features = df.drop('target_column', axis=1)
target = df['target_column']
# Split the data into training and testing sets, ‘train, test, split’
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)
# Train the linear regression model
reg = LinearRegression().fit(X_train, y_train)
# Predict using the trained model
y_pred = reg.predict(X_test)
# Evaluate the model's performance
score = reg.score(X_test, y_test)
print('Test score: ', score)
In this example, we start by loading the data from a CSV file into a Pandas DataFrame using the pd.read_csv function. Then, we split the data into features and target variables. The features are the independent variables, while the target is the dependent variable that we want to predict. Next, we split the data into training and testing sets using the train_test_split function from scikit-learn’s model selection module. This allows us to evaluate the performance of our model on a set of data it hasn’t seen before. Once we have the training and testing sets, we train the linear regression model using the fit method from the LinearRegression class. Then, we use the trained model to make predictions on the test data using the predict method. Finally, we evaluate the model’s performance using the score method, which returns the coefficient of determination (R2). The R2 value indicates the proportion of the variance in the target variable that is explained by the features. A value of 1.0 means that the model perfectly fits the data, while a value close to 0.0 means that the model does not fit the data well. See below for more information on measuring the model and it’s fit to make accurate predictions, also the concepts of over and under fitting are taken into consideration.
Features and Labels
In machine learning, a feature is a characteristic or attribute of an instance or observation that can be used as an input for a model. A feature can be thought of as a variable that describes an aspect of an instance or observation. For example, in a machine learning model to predict the price of a house, features might include the number of bedrooms, the square footage of the house, the neighborhood it is located in, etc. Features play a crucial role in building machine learning models as they provide the model with the information it needs to make predictions. A good set of features is often the key to building a high-performing model, while poor or irrelevant features can lead to under-performance.
The process of selecting and engineering features is often referred to as feature engineering, and it can be a time-consuming and iterative process. In many cases, the feature engineering process involves transforming or combining raw data into a more useful format for the model, as well as selecting a subset of the available features that are most relevant to the task at hand.
In machine learning, a label is a dependent variable or the target variable that you want to predict based on the input features. The label is the output that the machine learning model is trying to predict. In supervised learning, which is the most common type of machine learning, the model is trained on a labeled dataset that consists of instances (also known as samples or observations) with both features and corresponding labels. The model then learns to map the input features to the correct output label based on the relationship between the features and the labels in the training data.
The relationship between the features and the label can be thought of as the underlying function or decision boundary that the model is trying to learn. Once the model is trained, it can be applied to new, unseen instances with the same features to make predictions about the label. It is important to note that the label is not a feature, but it is related to the features in that the features are used as input to predict the label. The choice of features and the representation of the features can have a significant impact on the performance of the machine learning model.
In python code for machine learning, X is often used to represent the input features or the independent variables, while y is used to represent the target or dependent variable. For example, in a simple regression problem, X might represent the predictor variables, such as the size of a house, while y might represent the target variable, such as the price of the house. In a classification problem, X might represent the features of instances, such as the characteristics of a patient, while y might represent the class labels, such as whether a patient has a certain disease or not. In general, X is a 2D array-like object that holds the feature values for each instance in the dataset, while y is a 1D array-like object that holds the corresponding target values for each instance. When training a machine learning model, the goal is to fit a function that maps the input features X to the target values y. As seen in the ‘train, test, split’ example above where X and y are used as variable names.
Often you will see the equation f(x) which represents a mathematical function that maps input features x to the target or predicted value. The function f can be thought of as a model that is learned from the training data and used to make predictions for new, unseen instances. In the context of supervised learning, the goal is to find the best possible function f that fits the relationship between the input features x and the target values y in the training data. The quality of the function f is usually evaluated based on how well it predicts the target values for instances in the test data. For example, in a linear regression problem, the function f might be a linear combination of the input features, such as f(x) = w_0 + w_1 * x_1 + w_2 * x_2 + … + w_n * x_n, where w are the coefficients that are learned from the training data. In a decision tree, the function f might be a tree structure that represents a series of decisions based on the input features.
In general, the function f represents the learned relationship between the input features and the target values, and it is used to make predictions for new, unseen instances.
Gradient Descent
A typical definition of gradient descent is:
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.
However, to a mechanic this is still obtuse.

The gist of gradient descent is that you want to find the lowest point in the line, the arc of descent/ascent, like the creek bed in the forest: water always finds the lowest point to run along, the local minimum. What you are seeking in dealing with gradient descent is to find a piece of gold in the creek you need to step along the creek, if you step too far you will run right past it, if you go too slow, it will take forever. This stepping through the creek is what is known as the ‘learning rate’ or ‘step size’. The sweet spot, or goldilocks zone, is the point where there is a low value to loss vs weights. You are trying to find that spot but not overshoot it. If the bottom of the arc is large, then your learning rate can be larger, if the gradient of the arc is steep then a small learning rate is needed. Goodfellow et al note on gradient based optimization, “Most deep learning algorithms involve optimization of some sort. Optimization refers to the task of either minimizing or maximizing some function f(x) by altering x. We usually phrase most optimization problems in terms of minimizing f(x). Maximization may be accomplished via a minimization algorithm by minimizing −f (x).” (Goodfellow et al)
The function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.
Optimization algorithms that use only the gradient, such as gradient descent, are called first-order optimization algorithms. Optimization algorithms that also use the Hessian matrix, such as Newton’s method, are called second-order optimization algorithms” (Goodfellow et al, Ch 4.3).
Gradient descent is an optimization algorithm used to minimize a function, typically a cost function in machine learning. The algorithm works by iteratively updating the parameters of the model in the opposite direction of the gradient of the cost function with respect to those parameters. The mathematical equation for gradient descent is:
θ = θ – α * ∇θJ(θ)
where θ is the set of parameters being optimized, J(θ) is the cost function, and α is the learning rate, a small positive value that determines the step size at each iteration. The gradient, ∇θJ(θ), is a vector of partial derivatives of J(θ) with respect to each parameter in θ.
The code for θ = θ – α * ∇θJ(θ) in Python would depend on the specific implementation of the variables and function. Here’s an example of how this code could be written assuming that θ is a numpy array, α is a scalar value, and ∇θJ(θ) is a function that returns a numpy array:
import numpy as np
# Define the function to calculate the gradient of J(θ) with respect to θ (theta)
def grad_J(theta):
# ... implementation of the gradient calculation ...
return gradient
# Define the initial value of theta
theta = np.array([1.0, 2.0, 3.0])
# Define the learning rate alpha
alpha = 0.01
# Calculate the gradient of J(θ) with respect to θ
gradient = grad_J(theta)
# Update the value of theta
theta = theta - alpha * gradient
In this example, the grad_J function takes the current value of theta as input and returns the gradient of J(θ) with respect to theta. The theta variable is then updated using the learning rate alpha and the gradient of J(θ) with respect to theta, which is calculated using the grad_J function. The updated value of theta is stored back in the theta variable. Note that you would need to replace grad_J with the actual implementation of the gradient calculation for your specific problem.
The algorithm stops when a local or global minimum of the cost function is reached. The terms “local minimum” and “global minimum” refer to the minima of a cost function that is being optimized. The cost function is used to evaluate the performance of a machine learning model and its goal is to minimize this function. A local minimum is a minimum value of the cost function that is only the minimum within a certain region, or “neighborhood”, of the parameter space. In other words, it’s a minimum that is only the lowest compared to its nearby values. A global minimum is the minimum value of the cost function that is the minimum value throughout the entire parameter space. It’s the absolute lowest value that the cost function can take. In optimization problems, it’s often desirable to find the global minimum, as this corresponds to the best possible solution to the problem. However, finding the global minimum can be difficult, especially for complex and high-dimensional problems. In some cases, optimization algorithms may get stuck in a local minimum and fail to find the global minimum, leading to suboptimal solutions.
There are different variants of gradient descent, such as batch gradient descent, which uses the entire training set to compute the gradient at each iteration, and stochastic gradient descent, which uses a single example to compute the gradient at each iteration.
Linear gradient descent is a specific variant of the general gradient descent algorithm that is used to optimize linear models such as linear regression. The algorithm works by iteratively updating the parameters of the model, typically the coefficients of the linear equation, in the opposite direction of the gradient of the cost function with respect to those parameters. The cost function used in linear gradient descent is typically the mean squared error (MSE) between the predicted output and the true output. Linear gradient descent is a simple and efficient algorithm for linear models, but it can be sensitive to the choice of the learning rate and can get stuck in local minima. Alternative optimization techniques like batch gradient descent and stochastic gradient descent can also be applied to linear regression models.
Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm that is used for optimization of large datasets. Unlike batch gradient descent, which computes the gradient based on the average of the gradients of all the data points, SGD computes the gradient based on a single data point at a time. This makes it computationally more efficient and can also help with escaping local minima. The mathematical equation for stochastic gradient descent is the same as that for gradient descent.
Here is an example of how the SGD algorithm can be implemented in Python for linear regression:
# Initialize parameters
theta = np.random.randn(2)
# Define the learning rate
alpha = 0.01
# Number of iterations
n_iter = 1000
# Loop through the number of iterations
for i in range(n_iter):
# Select a random data point
rand_index = np.random.randint(len(X))
xi = X[rand_index:rand_index+1]
yi = y[rand_index:rand_index+1]
# Compute the gradient
gradient = 2 * xi.T.dot(xi.dot(theta) - yi)
# Update the parameters
theta = theta - alpha * gradient
In this example, X is the feature matrix and y is the target vector, theta is the parameters vector, alpha is the learning rate and n_iter is the number of iterations.
It is worth noting that, in practice, the learning rate is often decreased over time to help the algorithm converge more efficiently, see more on the importance of learning rate below. Additionally, there are more advanced versions of the SGD algorithm like: Mini-batch Gradient Descent which uses a small batch of examples at each iteration instead of a single example. Adaptive Gradient Descent methods like Adagrad, Adadelta and Adam, which automatically adapt the learning rate on a per-parameter basis, these are known as learning rate or scheduler optimizers in PyTorch. You can use any of the above methods to improve the performance of your model.
Adagrad, Adadelta, and Adam are three popular optimization algorithms used in machine learning to adjust the learning rate and improve the efficiency of gradient descent.
Adagrad: Adagrad stands for “adaptive gradient” and is an optimization algorithm that adapts the learning rate for each parameter based on its historical gradient information. It gives more weight to parameters with sparse gradients and less weight to parameters with frequent gradients. Adagrad has proven to be effective in handling sparse data and is commonly used in natural language processing and recommendation systems.
Adadelta: Adadelta is an extension of Adagrad that seeks to address its aggressive and monotonically decreasing learning rate. Adadelta replaces the learning rate with a running average of the past gradients and past squared gradients, which allows it to adapt to changing gradients and keep the learning rate from becoming too small. Adadelta has been shown to be effective in training deep neural networks.
Adam: Adam stands for “adaptive moment estimation” and is a popular optimization algorithm that combines the benefits of both Adagrad and Adadelta. Adam uses a combination of the first and second moments of the gradients to adapt the learning rate for each parameter. It also introduces bias correction to reduce the effect of the initial gradient estimates. Adam has been shown to be effective in training deep neural networks and is commonly used in computer vision, natural language processing, and other areas of machine learning.
In general, these optimization algorithms are used to overcome the limitations of traditional gradient descent methods, which can suffer from slow convergence or get stuck in local minima. By adapting the learning rate and taking into account historical gradient information, Adagrad, Adadelta, and Adam can accelerate the learning process and improve the accuracy of the resulting models.
Convergence in gradient descent refers to the point at which the optimization algorithm has found an optimal set of parameter values such that the loss function is minimized or optimized to a satisfactory degree. In other words, it is the point where further iterations of the algorithm no longer lead to significant improvement in the loss function. Gradient descent is an iterative optimization algorithm that seeks to find the optimal set of parameter values by adjusting the values in the direction of the steepest descent of the loss function. At each iteration, the algorithm calculates the gradient of the loss function with respect to the parameters and updates the parameter values by taking a step in the opposite direction of the gradient. This process is repeated until convergence is reached.
There are several ways to determine convergence in gradient descent. One common approach is to monitor the change in the loss function or the parameter values between iterations. If the change falls below a certain threshold, the algorithm is considered to have converged. Another approach is to set a maximum number of iterations and terminate the algorithm when that limit is reached. It is important to note that reaching convergence does not necessarily mean that the solution found is the global minimum of the loss function. In fact, in many cases, gradient descent may only converge to a local minimum or a saddle point. To address this issue, various modifications to the gradient descent algorithm have been proposed, such as using different initialization values, optimizing with respect to a subset of the parameters, or using more sophisticated optimization algorithms like Adagrad, Adadelta, and Adam.
Loss Functions
Loss function (also known as a cost function or objective function) is a function that measures the difference between the predicted values of a model and the actual values of the training data. The goal of a machine learning model is to minimize the loss function in order to improve its predictive accuracy. There are many different types of loss functions, and the choice of which one to use depends on the specific problem being solved. Some of the most common loss functions include:
Mean Squared Error (MSE): This is the most commonly used loss function for regression problems. It measures the average squared difference between the predicted and actual values.
Binary Cross-Entropy: This loss function is used for binary classification problems, where the output is either 0 or 1. It measures the difference between the predicted and actual probabilities.
Categorical Cross-Entropy: This loss function is used for multi-class classification problems, where the output can take on more than two values. It measures the difference between the predicted and actual probability distributions.
Hinge Loss: This loss function is commonly used for support vector machines (SVMs) in binary classification problems. It penalizes misclassifications and encourages the model to separate the data points with a large margin.
KL Divergence: This loss function is used in generative models such as variational autoencoders (VAEs) to measure the difference between the predicted distribution and the actual distribution of the data.
Huber Loss: This loss function is a combination of mean squared error and mean absolute error. It is more robust to outliers than MSE and is often used in regression problems.
The choice of which loss function to use is an important consideration when designing a machine learning model. Different loss functions may lead to different optimal parameter values and affect the performance of the model. It is important to select a loss function that is appropriate for the problem being solved and to monitor its behavior during training in order to adjust the model accordingly.
Models
A model defines the relationship between features and label. Two phases of a model’s life:
Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.
Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y’). For example, during inference, you can predict median HouseValue for new unlabeled examples.
A model in machine learning is a mathematical representation of a system or process that is used to make predictions or decisions. A model is trained using a set of input data and corresponding output data, and the goal of training is to find the set of parameters that best fit the data.
Hyperparameters are parameters that are set before the training of a model, unlike the parameters that are learned during the training process. They are used to control the behavior of the model and the learning algorithm. Examples of hyperparameters include the learning rate, the number of hidden layers in a neural network, and the number of trees in a random forest. Hyperparameters can have a significant impact on the performance of a model. Choosing appropriate hyperparameters can lead to a better-performing model, while choosing inappropriate ones can result in a model that underfits or overfits the data. Hyperparameter tuning is the process of finding the best set of hyperparameters for a given task and data. Hyperparameter tuning is an important aspect of effective ML. You can use GridSearchCV to optimize boosted decision trees, and in PyTorch you can use a library like Optuna to optimize PyTorch training hyperparameters.
Here is an example of training a model with hyperparameters and optimizing the hyperparameters with a grid search algorithm in Python using the scikit-learn library:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define the hyperparameter grid
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 5, 10]}
# Create the model
rf = RandomForestClassifier()
# Create the grid search object
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
# Print the best hyperparameters
print("Best Hyperparameters: ",grid_search.best_params_)
In this example, we are using a random forest classifier and tuning the n_estimators and max_depth hyperparameters. The GridSearchCV object is used to perform a grid search over the defined hyperparameter grid and find the best set of hyperparameters. The fit method is used to fit the grid search to the training data, and the best_params_ attribute is used to print the best hyperparameters found.
Supervised vs. Unsupervised Learning
In supervised learning a label is the thing we’re predicting—the y variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything.
One example of Unsupervised learning is clustering (K-Means). Cluster analysis- ascertain on the basis of X1…Xn, whether the observations fall into relatively distinct groups. In semi-supervised- some instances have labels others do not.
Variables (Features, Parameters, Coefficients)
A feature is an input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features
either quantitative/regression (num) or qualitative/classification (text) Exceptions: least square linear regression is used with quantitative responses, logistic regression is used with qualitative (two-class or binary response: male, female; it estimates class probabilities, it can be thought of as a regression method as well. Whether the predictors are qualitative or quantitative is considered less important
Hyperparameters
Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. Iterations is another important hyperparameter in linear regression. It is important not to confuse hyperparameters with model features (also known as parameters). A very good explanation of hyperparameters is given by Xavier Amatriain :
In machine learning, we use the term hyperparameter to distinguish from standard model parameters. So, it is worth to first understand what those are. Wheras, variables are used in the logic of an algorithm, the hyperparameters are variables that make the algorithm itself run.
A machine learning model is the definition of a mathematical formula with a number of parameters that need to be learned from the data. That is the crux of machine learning: fitting a model to the data. This is done through a process known as model training. In other words, by training a model with existing data, we are able to fit the model parameters.
However, there is another kind of parameters that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. They are called hyperparameters. Hyperparameters are usually fixed before the actual training process begins.
So, how are hyperparameters decided? That is probably beyond the scope of this question, but suffice to say that, broadly speaking, this is done by setting different values for those hyperparameters, training different models, and deciding which ones work best by testing them.
So, to summarize. Hyperparameters: Define higher level concepts about the model such as complexity, or capacity to learn. Cannot be learned directly from the data in the standard model training process and need to be predefined. Can be decided by setting different values, training different models, and choosing the values that test better Some examples of hyperparameters:
- Number of leaves or depth of a tree
- Number of latent factors in a matrix factorization
- Learning rate (in many models), a very important hyperparameter that can make or break a model.
- Number of hidden layers in a deep neural network
- Number of clusters in a k-means clustering
Loss and Measuring Quality of Fit
Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.
Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model’s prediction was on a single example. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples
In order to eval the performance of a statistical learning method on a given data set, we need a measure, we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. There are different kinds of eval metrics: MEA- Mean Absolute Error, is the mean of the abs err. MSE- Mean Squared Error, mean of the squared errors RMSE- sq.rt. of the MSE. Also see regularization below.
The Bias-Variance Trade off
In order to min the expected error, we need to select a statistical learning method that simultaneously achieves low variance and low bias.
variance- refers to the error that is introduced by approximating a real-life problem, by a much simpler model. For example, linear regression assumed that there is a linear relationship between Y and X1, X2,…Xn. It is unlikely any real-life problem truly has such a simple linear relationship.
As a general rule, as we use more flexible methods, the variance will increase and the bias decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to mutually decrease faster than the variance increases. Consequently, the expected test MSE declines. However at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the MSE increases.
The relationship b/w bias, variance + MSE is known as bias-variance trade off.
Bayes’ Theorem: describes the probability of an event, based on prior knowledge of conditions that might be related to the event. One of the many applications of Bayes’ Theorem is Bayesian inference, a particular approach to statistical inference. When applied, the probs involved in Bayes’ theorem may have different probabilistic interpretations. With the Bayesian probability interpretation, the theorem expresses how a subjective degree of belief should rationally change to account for availability of related evidence.
Bias-Unbiased
The term “bias” refers to the error introduced by approximating a real-world problem with a simplified model. The bias of a model is the difference between the expected value of the predictions and the true values of the data. A model with high bias tends to be too simplistic and may underfit the data, while a model with low bias may overfit the data. On the other hand, the term “unbiased” in machine learning typically refers to an estimator that has an expected value that is equal to the true value of the parameter being estimated. In other words, an unbiased estimator does not systematically overestimate or underestimate the parameter.
The relationship between these two concepts and the parameter μ depends on the specific problem being addressed. In some cases, the parameter μ may represent the true value of a population parameter, such as the mean or variance of a distribution. In other cases, μ may represent the optimal parameter values for a machine learning model. If the goal is to estimate the parameter μ using a machine learning model, the bias of the model will depend on how well the model captures the underlying structure of the data. A model with high bias will tend to underestimate or overestimate μ, leading to a biased estimator. An unbiased estimator, on the other hand, will have an expected value that is equal to μ, regardless of the complexity of the model.
If we use the mean (avg) of the sample μ̂ to estimate μ this estimate is unbiased, in the sense that on average, we expect μ̂ = μ, meaning on one set of observations y1,…yn μ̂ might overestimate, with another set of observations it might underestimate, but with an average of a large volume of observations it is more accurate.
It is important to realize that one can get stuck in optimization problems in what are known as local and global maxima and minima. Machine learning algorithms such as gradient descent algorithms may get stuck in local minima during the training of the models. Gradient descent is able to find local minima most of the time and not global minima because the gradient does not point in the direction of the steepest descent. Current techniques to find global minima either require extremely high iteration counts or a large number of random restarts for good performance. Global optimization problems can also be quite difficult when high loss barriers exist between local minima.
Any greedy algorithm — which try to maximize your objective by choosing whichever is locally best — may get stuck in local minima.
These include, but are not limited to:
- gradient descent optimization like neural networks, which try to gradually minimize the loss function by making small changes to the weights in the direction that minimize the loss. This also includes algorithms like gradient boosting. You can mitigate such problems by using second-order derivatives (Hessian) or smart heuristics.
- one look-ahead algorithms like decision trees, which locally choose the feature x and threshold t whose split x<t maximizes/minimizes a given metric like gini or entropy (and, as a subsequent, random forest) — if a decision tree was able to see the final impact of all possible combinations, then it might make different choices, but that would be infeasible.
- expectation-maximization (as used by e.g. k-means) is also heavily influenced by the initialization.
Predictions
3 Kinds of uncertainty associated with prediction:
a. coefficient estimates are the least square plane which is only an estimate for the true population regression plane. The inaccuracy in the coefficient estimates is related to the reducible error. We can compute a confidence interval in order to determine how close Y will be to f(x)
b. in practice assuming a linear model for f(x) is almost always an approximation of reality, model bias-when we use a linear model, we are in fact estimating the best linear approximation to the true surface, ignoring the real topology for linear approximation.
c. even if we knew f(x) the response value cannot be predicted perfectly because of the random error (ε), the irreducible error.
Prep and Clean the Data
A key step before running a linear regression on data is known as EDA. Exploratory Data Analysis (EDA) is an essential step in the machine learning process that provides valuable insights into the underlying patterns and relationships within a given dataset. It involves a thorough examination of the data to understand its characteristics and identify any potential issues or biases that may impact the modeling process. The goal of EDA is to gain a deeper understanding of the data and to inform the selection of appropriate models and algorithms for a given problem. One of the key advantages of EDA is that it helps to identify any missing or incorrect data, which can be crucial in avoiding incorrect or suboptimal results. This is particularly important in the case of large and complex datasets, where manual checks may not be feasible. By identifying and correcting any issues in the data, EDA ensures that the modeling process is based on accurate and reliable data, leading to more accurate predictions and improved model performance.
EDA also helps to identify patterns and relationships within the data, which can inform the selection of appropriate algorithms and models. For example, it may reveal correlations between variables that can be leveraged to improve the accuracy of predictive models. In addition, EDA can also provide insights into the distribution of the data, allowing for appropriate pre-processing and normalization steps to be taken. Another important aspect of EDA is that it helps to gain a deeper understanding of the problem being solved. This can be particularly useful in real-world applications where domain knowledge and expert insights are essential for effective problem-solving. By exploring the data and identifying patterns and relationships, EDA can provide valuable insights into the problem and help to guide the modeling process.
Visualizations are a good way to see the big picture of your data and understand the relationships in the data and identify any patterns or correlations. This information can be used to inform and guide the modeling process, leading to more accurate and effective results.
Matplotlib is a widely-used data visualization library in the Python programming language that provides a wide range of visualization capabilities. With its rich set of plotting functions and customization options, Matplotlib enables the creation of a wide range of visual representations, including line plots, scatter plots, bar charts, histograms, and more.
One of the key benefits of using Matplotlib for data visualization is its versatility. Matplotlib can be used to create simple plots and charts, as well as more complex visualizations, making it an ideal tool for exploring and understanding data in a wide range of applications. For example, scatter plots can be used to identify correlations between variables, while histograms can provide insights into the distribution of the data.
In addition to its versatility, Matplotlib also provides a high level of customization, enabling users to create visualizations that are tailored to their specific needs. This can include customizing the appearance of the plot, such as the axis labels, title, and color palette, as well as adding annotations and highlighting specific data points. Another important aspect of using Matplotlib for data visualization is its integration with other libraries and tools. For example, Matplotlib can be used in conjunction with other libraries such as NumPy and Pandas to create complex visualizations that leverage the full capabilities of these libraries. Additionally, Matplotlib provides the ability to export visualizations in a wide range of formats, including PNG, PDF, and SVG, making it easy to share visualizations and communicate insights.

Plot using matplotlib from https://www.kaggle.com/code/kanncaa1/data-sciencetutorial-for-beginners
The above plot can be coded as such:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
data.Speed.plot(kind = 'line', color = 'g',label = 'Speed',linewidth=1,alpha = 0.5,grid = True,linestyle = ':')
data.Defense.plot(color = 'r',label = 'Defense',linewidth=1, alpha = 0.5,grid = True,linestyle = '-.')
plt.legend(loc='upper right') # legend = puts label into plot
plt.xlabel('x axis') # label = name of label
plt.ylabel('y axis')
plt.title('Line Plot') # title = title of plot
plt.show()
Here are some common steps in exploratory data analysis (EDA) for machine learning:
- Understand the problem and the data: Before beginning the EDA process, it is important to have a clear understanding of the problem you are trying to solve and the data you are working with. This includes understanding the features, the target variable, and any domain-specific considerations.
- Import the data: The first step in EDA is to import the data into your environment. This can be done using Python libraries like Pandas, NumPy, or Scikit-learn.
- Clean the data: Once the data is imported, it is important to clean it to remove any missing or irrelevant data, and correct any errors or inconsistencies.
- Visualize the data: Data visualization is a powerful tool for understanding the distribution of the data, identifying patterns, and detecting outliers or anomalies. Common visualization techniques include histograms, scatterplots, boxplots, and heatmaps.
- Check for correlations: Correlations between features can help to identify relationships between the variables and guide feature selection. Pearson correlation coefficient, Spearman rank correlation, and Kendall rank correlation are some of the commonly used correlation measures.
- Feature engineering: Feature engineering is the process of creating new features from existing ones or selecting a subset of features for the model. It is important to select relevant features and remove redundant or noisy ones.
- Check for imbalanced data: In some cases, the target variable may be imbalanced, with one class having many more instances than the other. This can lead to biased models and reduced predictive power. It is important to identify and address any imbalances in the data.
- Check for outliers: Outliers are extreme values that can have a disproportionate impact on the model. It is important to identify and handle outliers appropriately to ensure that they do not adversely affect the model’s performance.
- Summarize the findings: Finally, it is important to summarize the findings of the EDA process in a clear and concise manner, including any insights gained, any issues identified, and any recommendations for future work.
After you have performed this initial step you are then ready to train a linear regression model using a regression algorithm.
A Recipe for a Linear Regression Algorithm
1. Train test split: The first step is to split the data into a training set and a test set. The training set is used to train the model, while the test set is used to evaluate the model’s performance on unseen data. The scikit-learn library in Python provides a convenient way to split the data into training and test sets.
2. Create and train the model: The next step is to create a linear regression model and train it on the training data. In Python, this can be done using the scikit-learn library.
3. Fit the data: Once the model is created, it needs to be fit to the training data. This involves finding the optimal values of the model parameters that minimize the difference between the predicted and actual values.
4. Evaluate the model: After fitting the model to the training data, the next step is to evaluate its performance on the test data. This can be done by checking the coefficients of the model and interpreting them in the context of the problem being solved.
5. Check coefficients: The coefficients of the model represent the strength and direction of the relationship between each feature in the dataset and the target variable. By examining these coefficients, we can gain insights into the factors that are driving the model’s predictions.
6. Make predictions: Once the model is trained and evaluated, we can use it to make predictions on new data. This involves applying the model to the feature values of the new data and predicting the target variable.
7. Check metrics: Finally, we need to check the metrics of the model, such as accuracy, confusion matrixes, precision, etc., to evaluate its performance on the test data. This will give us a sense of how well the model is performing and whether it needs to be improved or refined.
An example of a simple linear regression algorithm in Python using the scikit-learn library:
# Import libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load the data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create a linear regression model
model = LinearRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
# Evaluate the model on the test data
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
# Print the coefficients and metrics
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean squared error:", mse)
In this example, we first load a small dataset with five points, where the input feature X is the integers 1 through 5 and the target variable y is the corresponding even integers. We then split the data into training and test sets using the train_test_split function from scikit-learn. Next, we create a LinearRegression model and fit it to the training data using the fit method. We then use the model to make predictions on the test data using the predict method, and evaluate the performance of the model using the mean squared error metric. Finally, we print the coefficients of the model and the mean squared error.
Potential Problems of Linear Regression
There are several tricky areas related to linear regression, some of these areas are covered below. It is important to understand that linear regression is a mathematical methodology of dealing with quantifiable objects, a numerical relationship to other numerical relationships.
1. Non-linearity of the r-p relationships
The term “non-linearity of the r-p relationships” usually refers to the fact that the relationship between input features (also called predictors or independent variables) and the target variable (also called the response or dependent variable) may not be linear, but rather nonlinear. The “r-p” refers to the correlation (r) and regression (p) relationships between the input features and the target variable.
Linear models assume that the relationship between the input features and the target variable is linear, which means that the target variable can be modeled as a linear combination of the input features, with a fixed slope and intercept. However, in many real-world scenarios, the relationship between the input features and the target variable is more complex and nonlinear. For example, in image recognition, the relationship between the pixel values of an image and the object in the image is highly nonlinear.
Nonlinear models can capture these complex relationships by using nonlinear functions of the input features, such as polynomial or exponential functions, or by using more complex models like decision trees or neural networks. These models can capture the nonlinear relationships between the input features and the target variable, and may provide better predictive power than linear models.
If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are questionable, thus the prediction accuracy can be reduced.
Residual plots are a tool to detect non-linearity if the residual plots indicate there are non-linear associations in the data, then a simple approach is to use a non-linear transformation of the predictors; logX, sq. rt of X and X2 in the regression model.
2. Correlation of Error Terms
important assumption in linear regression is that error terms ε1,ε2…,εn are uncorrelated. If there is correlation then the estimated standard error with underestimate the true standard error. Often occurs in time series data which is detectable through tracking (sim values for i) in adjacent residuals.
3. Non-Constant Variance of Error Terms
Another important assumption of the linear regression model is that the error terms have a constant variance, The standard errors, confidence intervals, and hypothesis tests associated with the linear model rely upon this assumption. But often they are non-constant. the assumption of constant variance of error terms is an important assumption in many models, especially linear regression models. The error term is the difference between the predicted values of the target variable and the actual values. If the error term has a constant variance, it means that the spread of the errors is roughly the same across all values of the predictor variables.
The constant variance assumption is also known as homoscedasticity. Homoscedasticity is important because it ensures that the model is consistent across the entire range of predictor variables, and that the model is not overly sensitive to outliers or extreme values.
In contrast, when the variance of the error term changes across different levels of predictor variables, it is known as heteroscedasticity. Heteroscedasticity can lead to biased and inconsistent estimates of the model parameters, and can also affect the accuracy and reliability of the predictions.
To check for homoscedasticity, analysts can plot the residuals (the difference between the actual and predicted values) against the predicted values. A scatter plot of the residuals should show no obvious pattern, and the spread of the residuals should be roughly constant across all levels of the predictor variables.
If heteroscedasticity is detected, analysts may need to use techniques such as weighted least squares or generalized least squares to correct for it. Alternatively, they may need to transform the data to achieve a more constant variance, or use different models that are more robust to heteroscedasticity.
4.Outliers
An outlier is a point for which y, is far from the val predicted by the model residual plots identify outliers we can plot the studentized residuals- computed by dividing each residual e, by its estimated standard error. Possible outliers have a value of >3, if outliers are data recording error, remove it. an outlier is an observation that differs significantly from other observations in the dataset. Outliers are data points that are unusually far from the majority of the data points, either in terms of their values or their behavior.
Outliers can occur for many reasons, including measurement errors, data entry errors, or legitimate but extreme values. Outliers can have a significant impact on the results of machine learning models, particularly on models that are sensitive to the presence of extreme values. Outliers can be detected using various statistical methods, such as the z-score or the interquartile range (IQR). The z-score is a measure of how many standard deviations an observation is away from the mean of the data, while the IQR is a measure of the spread of the data that is less sensitive to extreme values than the standard deviation.
Once outliers are identified, there are several options for how to handle them in machine learning. One approach is to remove the outliers from the dataset entirely, which can improve the performance of some models. Another approach is to treat the outliers as a separate category, which can be useful in some cases where the outliers represent a distinct group of observations. Alternatively, some models can be designed to be more robust to the presence of outliers, for example by using techniques like robust regression or decision trees.
5. High Leverage Points
observations with high leverage have an unusual value for predictor x. They have high impact on the estimated regression line. Leverage statistics is used to compute an observations leverage, “leverage” refers to the extent to which an individual observation affects the model’s estimates of the parameters. An observation with high leverage has a value for a predictor variable that is unusual compared to the other observations in the dataset. High leverage observations can have a significant impact on the results of a model, particularly if they are also outliers.
The presence of high leverage points can affect the accuracy and reliability of a model, particularly linear regression models. In linear regression, the influence of an observation on the model’s estimates of the parameters is proportional to the distance of the observation’s predictor value from the mean predictor value. Therefore, high leverage points can have a large influence on the model’s estimates, particularly if they are also outliers.
To identify high leverage points, analysts can examine the studentized residuals, which are residuals that have been adjusted for their leverage. A plot of studentized residuals against the leverages can help identify observations with high leverage. If high leverage points are identified, there are several options for how to handle them in machine learning. One approach is to remove the high leverage points from the dataset entirely, which can improve the performance of some models. Another approach is to transform the predictor variables to reduce the influence of high leverage points, or to use robust regression techniques that are less sensitive to the presence of high leverage points.
6. Collinearity
Collinearity (also known as multicollinearity) refers to the situation where two or more predictor variables in a model are highly correlated with each other. When two or more predictors are highly correlated, it becomes difficult for the model to determine the individual effects of each predictor variable on the outcome variable, because the effects of each predictor are confounded with the effects of the other predictors. Collinearity can lead to unreliable and unstable estimates of the regression coefficients in a linear regression model, because small changes in the data can lead to large changes in the estimates. In some cases, the collinearity may be so severe that the model cannot be estimated at all.
Collinearity can be detected using several techniques, such as correlation matrices, variance inflation factors (VIFs), and eigenvalues. If collinearity is detected, one option is to remove one or more of the highly correlated predictors from the model. Alternatively, techniques such as ridge regression or principal component regression can be used to reduce the impact of collinearity on the model’s estimates. It is important to note, however, that removing variables or using complex techniques to account for collinearity should be done with care, as they can sometimes result in loss of important information or overfitting the model.
2 or more predictor variables are closely related to each other. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. Collinearity reduces the accuracy of the estimates of the regression coefficients it causes the standard error for β̂ to increase. The hypothesis test- the probability of correctly detecting a non-zero coefficient- is reduced. Detection: look at correlation matrix of the predictors- a large value means collinearity for multicollinearity assess the variance inflation factor (VIF). As a rule a VIF val > 5 = bad.
Overfitting and Underfitting
Overfitting and underfitting are two common problems that can occur in machine learning. They occur when a model is trained on a limited dataset and then tested on new data, resulting in poor performance. Overfitting occurs when a model is trained too well on the training data, and as a result, it performs poorly on new data. This happens when the model is too complex and is able to fit the noise in the training data. The model ends up memorizing the training data and is not able to generalize to new examples. This can be identified by a high training accuracy and a low test accuracy.
Underfitting occurs when a model is not trained well enough on the training data and as a result, it performs poorly on new data. This happens when the model is too simple and is not able to capture the underlying pattern in the data. This can be identified by a low training accuracy and a low test accuracy.
To avoid overfitting and underfitting, several techniques can be used such as cross-validation, regularization, and early stopping.
Cross-validation is a technique used to evaluate a model’s performance by dividing the data into training and test sets. The model is trained on the training set and its performance is evaluated on the test set. This helps to identify if the model is overfitting or underfitting.
Regularization is a technique used to prevent overfitting by adding a penalty term to the model’s loss function. This term discourages the model from assigning too much importance to any one feature.
Early stopping is a technique used to prevent overfitting by stopping the training process when the model’s performance on the validation set stops improving.
L2 regularization, also known as weight decay, is a technique used to prevent overfitting in machine learning models. It works by adding a penalty term to the model’s loss function, which discourages the model from assigning too much importance to any one feature. The regularization term is added to the loss function in the form of the sum of the squares of the model’s weights (also known as the L2 norm of the weights). The regularization term is multiplied by a scalar value called the regularization strength or lambda, which controls the amount of regularization applied.
The L2 regularization term has the effect of shrinking the model’s weights towards zero, which can help to reduce the variance in the model’s predictions. This is because small weights are less likely to fit the noise in the training data, resulting in a more generalizable model.
In practice, L2 regularization is often used in conjunction with other techniques such as cross-validation and early stopping to prevent overfitting. It is also commonly used in neural network models, such as in weight decay for training of deep learning models. One important thing to note is that L2 regularization can also have a computational cost, because the added term in the loss function increases the size of the computation graph and the number of parameters to be updated during the optimization process.
A confusion matrix is a table that is often used to evaluate the performance of a machine learning model, particularly in the context of classification problems. It is used to compare the predicted class labels of a model with the true class labels of the data. The confusion matrix shows the number of correct and incorrect predictions for each class, allowing for a more detailed analysis of the model’s performance. A confusion matrix is typically represented as a table with rows representing the true class labels and columns representing the predicted class labels. The cells in the table contain the count of observations for each combination of true and predicted class labels.
For example, consider a binary classification problem where the true class labels are “positive” and “negative” and the predicted class labels are also “positive” and “negative”. A confusion matrix for this problem would have four cells, with the top left cell representing the number of true positive predictions, the top right cell representing the number of false positive predictions, the bottom left cell representing the number of false negative predictions, and the bottom right cell representing the number of true negative predictions.
The information contained in the confusion matrix can be used to calculate various performance metrics, such as accuracy, precision, recall, and F1-score.
Accuracy is the proportion of correct predictions made by the model. It is calculated as the number of correct predictions (true positives and true negatives) divided by the total number of predictions.
Precision is the proportion of true positive predictions among all positive predictions made by the model. It is calculated as the number of true positives divided by the sum of true positives and false positives.
Recall, also known as sensitivity or true positive rate, is the proportion of true positive predictions among all actual positive observations. It is calculated as the number of true positives divided by the sum of true positives and false negatives.
F1-score is the harmonic mean of precision and recall. It is a measure of the balance between precision and recall, with a high F1-score indicating a good balance.
In addition to these performance metrics, the confusion matrix also allows for an analysis of the model’s behavior across different classes. For example, if a model is frequently misclassifying a certain class, this information can be used to improve the model’s performance on that class.

A confusion matrix is a useful tool for evaluating the performance of a machine learning model, particularly in the context of classification problems. It shows the number of correct and incorrect predictions for each class, allowing for a more detailed analysis of the model’s performance. The information contained in the confusion matrix can be used to calculate various performance metrics, such as accuracy, precision, recall, and F1-score. It also allows for an analysis of the model’s behavior across different classes and can help identify areas for improvement.
A confusion matrix can be generated using the confusion_matrix function from the sklearn.metrics library. Here is an example of how to use this function to generate a confusion matrix for a binary classification problem:
import numpy as np
from sklearn.metrics import confusion_matrix
# True class labels
y_true = np.array([0, 0, 1, 1, 1, 1])
# Predicted class labels (generated by a model)
y_pred = np.array([0, 1, 0, 1, 1, 1])
# Generate confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
This will output the confusion matrix in the form of a 2x2 numpy array:
Copy code
[[1 1]
[1 4]]
‘’’
The above matrix can be read as:
1 true negatives
1 false positives
1 false negatives
4 true positives
It's also possible to use PyTorch's torch.tensor instead of numpy array
‘’’
Once you have the confusion matrix, you can use it to calculate the performance metrics such as accuracy, precision, recall and F1-score.
In PyTorch, the accuracy of a model can be calculated using the torch.sum() and torch.mean() functions. Here is an example of how to calculate the accuracy of a binary classification model:
import torch
# True class labels
y_true = torch.tensor([0, 0, 1, 1, 1, 1], dtype=torch.long)
# Predicted class labels (generated by a model)
y_pred = torch.tensor([0, 1, 0, 1, 1, 1], dtype=torch.long)
# Calculate accuracy
accuracy = torch.mean((y_true == y_pred).float())
print(accuracy)
i
In the above code, the torch.tensor() function is used to create the true and predicted class labels. The == operator is used to compare the true labels with the predicted labels and create a tensor of the same shape with True for correct predictions and False for incorrect predictions. The .float() method is used to convert the tensor to float and the torch.mean() function is used to calculate the mean of the tensor which is equivalent to the accuracy.
It’s also possible to use torch.sum() function to calculate the number of correct predictions and divide it by the total number of predictions to get the accuracy:
# Calculate accuracy
accuracy = torch.sum(y_true == y_pred).float() / y_true.size(0)
print(accuracy)
It’s important to note that accuracy is not always the best metric to evaluate a model’s performance, particularly in cases where the classes are imbalanced or when the model’s performance is poor. Other metrics such as precision, recall, and F1-score should also be considered to get a more complete picture of the model’s performance.
Bias in machine learning refers to the difference between the expected predictions of a model and the true values in the data. A model with high bias is one that makes strong assumptions about the data and as a result, is not able to capture the complexity of the underlying patterns in the data. A model with high bias will have a high training error and will also perform poorly on unseen data. This is because the model is not able to generalize well from the training data to new examples. High bias models are also known as underfitting models.
In a supervised learning problem, bias can be thought of as the difference between the average prediction of our model and the true values of the output we are trying to predict. A high bias model will have a large difference between the average prediction and the true values, while a low bias model will have a smaller difference.
Bias is often caused by using a model that is too simple for the data or by not having enough data to properly train the model. To reduce bias, one can use more complex models or increase the amount of training data available. For example, using a deep learning model such as a neural network that can learn a more complex representation of the data can help to reduce bias.