Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that predicts the outcome of a response variable by combining several explanatory variables. Modeling the linear relationship between the explanatory (independent) variables and response (dependent) variables is the aim of multiple linear regression. In essence, multiple regression is the extension of ordinary least-squares (OLS) regression because it involves more than one explanatory variable.
Key Points in Multiple Linear Regression (MLR)
Assumptions of multiple linear regression
Multiple linear regression makes all of the same assumptions as simple linear regression:
Homogeneity of variance (homoscedasticity): The size of the error in our prediction does not vary significantly across independent variable values.
Independence of observations: The dataset's observations were gathered using statistically valid sampling methods, and there are no hidden relationships between variables.
Normality: A normal distribution can be inferred from the data.
Linearity: The line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.
In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.
The formula for a multiple linear regression is:
yi=β0+β1xi1+β2xi2+...+βpxip+ϵ
where, for i=n
observations:
yi
=dependent variable
xi
= explanatory variables
β0 =y
-intercept (constant term)
βp
= slope coefficients for each explanatory variableϵ=the model’s error term (also known as the residuals)