• Good models require multiple regressions in order to address the higher complexity of problems.
  • Below is the equation for the Population Model:
\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k + \epsilon \]
  • The Multiple Regression equation for Sample Data:
\[ \hat{y} = b_0 + b_1x_1 + b_2x_2 + \cdots + b_kx_k \]

Adjusted R-Squared

R-Squared is used to compare models using the same dataset.

The Adjusted R-Squared is less than R-Squared, and penalizes excessive use of variables.

\[ \bar{R}^2 < R^2\]

Test for the Significance of the Model (F-Test)

The lower the F-statistic, the closer you are to a non-significant model.

This is another tool that allows us to compare models.

OLS Assumptions

  1. Linearity – straight line approximation.
  2. No endogeneity – Endogeneity refers to situations in which a predictor in a linear regression model is correlated to the error term. i.e. All independent variables are uncorrelated with the error term. This is also known as Omitted Variable Bias since this often occurs when a relevant variable in not included in the model.
  3. Normality and homoscedasticity of the error term – Normality means the error term is normally distributed. i.e. error has a zero mean. Homoscedasticity means equal variance. This can be addressed by checking for OVB (omitted variable bias), or removing outliers, or performing a log regression. This is usually done with log(y) = b0 + b1x1 or log(y) = b0 + b1 * log(x1).
  4. No autocorrelation – Autocorrelation (or Serial Correlation) is the phenomenon where adjacent observations are correlated i.e. previous value has an impact on the next value. Another way to think of this is it’s when the values in a set of data are similar to the values that came before them. Autocorrelation is measured with Durbin-Watson with a range of 0 – 4. A value of 2 means no autocorrelation. Values above 3 or below 1 are of concern.
  5. No muliticolinearity  – the occurrence of high intercorrelations among two or more independent variables in a multiple regression model.

Dummy Variables – Handling Categorical Data

Dummy Variables include categorical data into a data model. A dummy variable is a binary variable that takes a value of 0 or 1. One adds such variables to a regression model to represent factors which are of a binary nature i.e. they are either observed or not observed.