The Linear Regression Model

  • A linear regression is a linear approximation of a causal relationship between two or more variables.
  • Regression models are highly valuable, as they are one of the most common ways to make inferences and predictions.
  • As with many other statistical techniques, regression models help us make predictions about the population based on sample data.
  • The predicted Dependent Variable is Y.
  • The predictors are called Independent Variables and are represented by x1, x2, …, xk.

Linear Regression for Population Models

  • y = dependent variable
  • x1 = independent variable
  • β1 = slope
  • β0 = constant
  • ε = error
\[ y = \beta_0 + \beta_1x_1 + \epsilon \]

Linear Regression for Sample Models

\[ \hat{y} = b_0 + b_1x_1 \]

Correlation vs Regression

Correlation does not imply causation.

Python Packages

These are the Python packages that we will most often be using:

import numpy as np 
import pandas as pd
import scipy
import statsmodels.api as sm
import matplotlib as plt
import seaborn as sns
import sklearn
  • numpy – allows us to work with multi-dimensional arrays.
  • pandas – organizes data in tabular form, and attach descriptive labels to rows and columns.
  • scipy – a Python ecosystem containing lots of tools for scientific calculations.
  • statsmodels – will be used to create regressions.
  • matplotlib – a 2D plotting library.
  • seasborn – a Python library for drawing statistical graphics
  • sklearn – used to code regressions.

Decomposition of Variability

Sum of Squares Total (SST)

  • A measure of the total variability of the dataset.
  • The dispersion of the observed variables around the mean.
  • Also known as Total Sum of Squares (TSS).
\[ SST = \sum_{i=1}^{n}(y_i – \overline{y})^2 \]

Sum of Squares Regression (SSR)

  • The dispersion of the predicted variables around the mean.
  • This can be thought of as a measure of how well our regression line fits the data.
  • Also know as Explained Sum of Squares (ESS).
\[ SSR = \sum_{i=1}^{n}(\hat{y} – \overline{y})^2 \]

Sum of Squares Error (SSE)

  • Error is the difference between the observed value and the predicted value.
  • Also known as Residual Sum of Squares (RSS).
\[ SSE = \sum_{i=1}^{n}e_i^2 \]

Connection

The total variability of the data set is equal to the variability explained by the regression line plus the unexplained variability (error).

\[ SST = SSR + SSE \]

Ordinary Least Squares (OLS)

OLS (ordinary least squares) is one of the most common methods for estimating the linear regression equation.

Other Methods to Find the Regression Line

  • Generalized Least Squares
  • Maximum likelihood Estimation
  • Bayesian Regression
  • Kernel Regression
  • Gaussian Process Regression

R-Squared

  • R2 = variability explained by the regression divided by total variability of the dataset.
  • Measures the goodness of fit of the model.
  • Values range from 0 to 1.
\[ R^2 = \frac{SSR}{SST} \]

Linear Regression Cheat Sheet