The Linear Regression Model

A linear regression is a linear approximation of a causal relationship between two or more variables.
Regression models are highly valuable, as they are one of the most common ways to make inferences and predictions.
As with many other statistical techniques, regression models help us make predictions about the population based on sample data.
The predicted Dependent Variable is Y.
The predictors are called Independent Variables and are represented by x₁, x₂, …, x_k.

Linear Regression for Population Models

\[ y = \beta_0 + \beta_1x_1 + \epsilon \]

Linear Regression for Sample Models

\[ \hat{y} = b_0 + b_1x_1 \]

Correlation vs Regression

Correlation does not imply causation.

These are the Python packages that we will most often be using:

import numpy as np 
import pandas as pd
import scipy
import statsmodels.api as sm
import matplotlib as plt
import seaborn as sns
import sklearn

numpy – allows us to work with multi-dimensional arrays.
pandas – organizes data in tabular form, and attach descriptive labels to rows and columns.
scipy – a Python ecosystem containing lots of tools for scientific calculations.
statsmodels – will be used to create regressions.
matplotlib – a 2D plotting library.
seasborn – a Python library for drawing statistical graphics
sklearn – used to code regressions.

\[ SST = \sum_{i=1}^{n}(y_i – \overline{y})^2 \]

The dispersion of the predicted variables around the mean.
This can be thought of as a measure of how well our regression line fits the data.
Also know as Explained Sum of Squares (ESS).