Linear Regression with sklearn

Calculating Adjusted R-Squared

Note that regular R² can be found using the following command, where ‘reg’ is the name of the sklearn linear regression object:

reg.score.(x,y)

sklearn does not calculate adjusted R-squared, so we have to do it once we find R². The equation is below, where n = the number of samples, and p = the number of coefficients.

\[ \bar{R}^2 = 1 – (1 – R^2) * \frac{n-1}{n-p-1} \]

Below is a Python function that will find the Adjusted R-squared:

def adj_r2(x,y):
    r2 = reg.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2

Feature Scaling (Standardization)

The process of transforming data into a standard scale.

\[ standardized \: variable = \frac{x – \mu}{\sigma} \]

Where:

x = the original variable
µ = mean of the original variable
σ = the standard deviation of the original variable

This results in:

mean = 0
standard deviation = 1

Underfitting and Overfitting

Overfitting – our training has focused on the particular training set so much, it has “missed the point”. High training accuracy, but outliers cause chaotic graphs.
Underfitting – the model has not captured the underlying logic of the data. Low accuracy. Usually a different model is needed.

Dealing with Missing Values in a Dataset

Rule of Thumb:

If 5% or less of your observations have missing values, they can simply be removed.
One way to deal with outliers is to simply remove the top %1 of observations.

Variance Inflation Factor (VIF)

VIF (Variance Inflation Factor) is a statistical measure that is used to assess the multicollinearity between predictor variables in a regression model. In machine learning, VIF is commonly used to check for collinearity between independent variables in linear regression models.
Multicollinearity occurs when two or more predictor variables are highly correlated with each other, making it difficult to determine the true effect of each variable on the dependent variable. This can lead to unstable and unreliable regression coefficients.
VIF is a measure of the extent to which the variance of the estimated regression coefficients is increased due to multicollinearity. A VIF value of 1 indicates that there is no correlation between the predictor variable and the other variables in the model. A VIF value greater than 1 indicates that there is some correlation between the predictor variable and the other variables in the model, with higher values indicating stronger correlation.
A commonly used rule of thumb is that a VIF value greater than 5 indicates the presence of significant multicollinearity. In such cases, it is advisable to remove one or more of the highly correlated variables from the model or use techniques like principal component analysis (PCA) to reduce the dimensionality of the predictor variables.