Simple linear regression is an important tool for understanding relationships between quantitative data, but it has its limitations. One obvious deficiency is the constraint of having only one independent variable, limiting models to one factor, such as the effect of the systematic risk of a stock on its expected returns. Real relationships are often much more complex, with multiple factors. To allow for multiple independent variables in the model, we can use multiple regression, or multivariate regression.
The multivariate regression is similar to linear regression, except that it accommodates for multiple independent variables. The model for a multiple regression can be described by this equation:
- y is the dependent variable
- xi is the independent variable
- βi is the coefficient for the independent variable
The coefficients are often different from the coefficients you would get if you ran a univariate regression for each factor. Multiple regression finds the relationship between the dependent variable and each independent variable, while controlling for all other variables. To give a concrete example of this, consider the following regression:
This model would be created from a data set of house prices, with the size, age and number of rooms as independent variables. Each extra unit of size is associated with a $20 increase in the price of the house, controlling for the age and the number of rooms. In our simple model, note that we would probably be able to predict house size from the number of rooms with a fair degree of accuracy, a phenomenon known in statistics as ‘multicollinearity’. One consequence of this ‘lack of independence’ of these two independent variables is that the coefficient estimates will tend to be less precise than would otherwise be the case. If we feel confident that both variables quantify the same phenomenon then one of the variables is redundant, and can be deleted from the model.
Linear regression can be visualized by a line of best fit through a scatter plot, with the dependent variable on the y axis. Multiple regressions with two independent variables can be visualized as a plane of best fit, through a 3-dimensional scatter plot.
Running Multivariate Regressions
Multiple regressions can be run with most stats packages. Running a regression is simple, all you need is a table with each variable in a separate column and each row representing an individual data point. For example, if you were to run a multiple regression for the Fama- French 3-Factor Model, you would prepare a data set of stocks. Each row would be a stock, and the columns would be its excess return, market risk premium, size effect, and value premium.
Your stats package will run the regression on your data and provide a table of results. The results may be reported differently from software to software, but the most important pieces of information will be:
- R Squared
- Adjusted R Squared
- Coefficients for each factor (including the constant)
- P-value for each coefficient
The R Squared is the proportion of variability in the dependent variable that can be explained by the independent variables in the model. A large R Squared value is usually better than a small R Squared value, except when overfitting is present.
For example, an R Squared value of 0.75 in the Fama- French model would mean that the three factors in the model, market risk, size, and value, can explain 75% of the variation in stock returns. The other 25% is unexplained and can be due to factors not in the model or measurement error. The R Squared value of a Fama-French model can also be used as a proxy for the activeness of a fund: the returns of an active fund should not be fully explained by the Fama-French model (otherwise anyone could just use the model to build a passive portfolio that achieves the same returns).
Adjusted R Squared
The adjusted R Squared is the R Squared value, but with a penalty for the number of independent variables used in the model. The R Squared value typically increases with the inclusion of more factors in the model, but the model should really ignore new factors unless they help to explain the dependent variable. This poses a problem because if we select the best model based on R Squared value we will end up selecting models with more factors, which have a tendency to overfit the data. The adjusted R Squared can become smaller as you include more variables if the new variables do not contain sufficient explanatory power. When choosing the best predictive model for your analysis, you should therefore choose the model with the highest adjusted R Squared.
For example, if we were to add another factor, momentum, to our Fama-French model, we may raise the R Squared by 0.01 to 0.76. However, we cannot conclude that the additional factor helps explain more variability, and that the model is better, until we consider the adjusted R Squared. If the adjusted R Squared decreases by 0.02 with the addition of the momentum factor, we should not include momentum in the model.
Each coefficient represents the change in the dependent variable associated with a 1 unit change in the relevant independent variable, controlling for the other independent variables. The coefficients can be used to understand the effect of each factor (its direction and its magnitude).
The p-value tells us about the statistical significance of each coefficient. By looking to see whether the p-value is lower than the alpha value (1 minus the confidence level, typically 0.05), we can determine whether the coefficient is significantly different from 0. An independent variable with a statistically insignificant factor may not be valuable, and so we might want to delete it from the model.
Interpreting Multivariate Regressions
When we talk about the results of a multivariate regression, it is important to note that:
- The coefficients may or may not be statistically significant
- The coefficients hold true on average
- The coefficients imply association not causation
- The coefficients control for other factors
A good example of an interpretation that accounts for these is:
Controlling for the other variables in the model, the size of the company is associated with an average decrease in expected returns of 2%. This relationship is statistically significant at the 95% confidence level.
The most common mistake here is confusing association with causation. No matter how rigorous or complex your regression analysis is, you cannot establish causation. Establishing causation will require experimentation and hypothesis testing.
[If you would like to download our Brief Guide to Statistics, please click here.]
Jason Oh is a management consultant at Novantas with expertise in scaling profitability for retail banks (consumer / commercial finance) and diversified financial service firms (credit card / wealth management / asset management).
Click to go to the full article: