The Linear Regression Model
- A linear regression is a linear approximation of a causal relationship between two or more variables.
- Regression models are highly valuable, as they are one of the most common ways to make inferences and predictions.
- As with many other statistical techniques, regression models help us make predictions about the population based on sample data.
- The predicted Dependent Variable is Y.
- The predictors are called Independent Variables and are represented by x1, x2, …, xk.
Linear Regression for Population Models
- y = dependent variable
- x1 = independent variable
- β1 = slope
- β0 = constant
- ε = error
\[ y = \beta_0 + \beta_1x_1 + \epsilon \]
Linear Regression for Sample Models
\[ \hat{y} = b_0 + b_1x_1 \]
Correlation vs Regression
Correlation does not imply causation.
Python Packages
These are the Python packages that we will most often be using:
import numpy as np
import pandas as pd
import scipy
import statsmodels.api as sm
import matplotlib as plt
import seaborn as sns
import sklearn
- numpy – allows us to work with multi-dimensional arrays.
- pandas – organizes data in tabular form, and attach descriptive labels to rows and columns.
- scipy – a Python ecosystem containing lots of tools for scientific calculations.
- statsmodels – will be used to create regressions.
- matplotlib – a 2D plotting library.
- seasborn – a Python library for drawing statistical graphics
- sklearn – used to code regressions.
Decomposition of Variability
Sum of Squares Total (SST)
- A measure of the total variability of the dataset.
- The dispersion of the observed variables around the mean.
- Also known as Total Sum of Squares (TSS).
\[ SST = \sum_{i=1}^{n}(y_i – \overline{y})^2 \]
Sum of Squares Regression (SSR)
- The dispersion of the predicted variables around the mean.
- This can be thought of as a measure of how well our regression line fits the data.
- Also know as Explained Sum of Squares (ESS).
\[ SSR = \sum_{i=1}^{n}(\hat{y} – \overline{y})^2 \]
Sum of Squares Error (SSE)
- Error is the difference between the observed value and the predicted value.
- Also known as Residual Sum of Squares (RSS).
\[ SSE = \sum_{i=1}^{n}e_i^2 \]
Connection
The total variability of the data set is equal to the variability explained by the regression line plus the unexplained variability (error).
\[ SST = SSR + SSE \]
Ordinary Least Squares (OLS)
OLS (ordinary least squares) is one of the most common methods for estimating the linear regression equation.
Other Methods to Find the Regression Line
- Generalized Least Squares
- Maximum likelihood Estimation
- Bayesian Regression
- Kernel Regression
- Gaussian Process Regression
R-Squared
- R2 = variability explained by the regression divided by total variability of the dataset.
- Measures the goodness of fit of the model.
- Values range from 0 to 1.
\[ R^2 = \frac{SSR}{SST} \]