Multiple Regression Multiple regression Typically, we want to use more than a single predictor (independent variable) to make predictions Regression with more than one predictor is called multiple y i = + 1 x1i + 2 x 2i + K + p x pi + i regression Motivating example: Sex discrimination in wages In 1970s, Harris Trust and Savings Bank was sued for discrimination on the basis of sex. Analysis of salaries of employees of one type (skilled, entry-level clerical) presented as evidence by the defense. Did female employees tend to receive lower starting salaries than similarly qualified and experienced male employees? Variables collected 93 employees on data file (61 female, 32 male). bsal: sal77 : educ:

exper: bank. fsex: senior: age: Annual salary at time of hire. Annual salary in 1977. years of education. months previous work prior to hire at 1 if female, 0 if male months worked at bank since hired months So we have six xs and and one y (bsal). However, in what follows we wont use sal77. Comparison for male and females started at higher salaries than women (t=6.3, p<.0001). But, it doesnt control for other characteristics. Oneway Analysis of bsal By fsex 8000 7000 bsal This shows men 6000

5000 4000 Female Male fsex Relationships of bsal with other variables Senior and education predict bsal Fit Y by X Group well. We want to control for them when judging gender effect. Bivariate Fit of bsal By age Bivariate Fit of bsal By educ Bivariate Fit of bsal By exper 8000 8000 7000 7000 7000 7000 6000 6000 6000

6000 bsal 8000 bsal 8000 bsal bsal Bivariate Fit of bsal By senior 5000 5000 5000 5000 4000 4000 4000 4000 60 65 70 75 80 85 90 95 100 300 senior Linear Fit

400 500 600 700 800 7 8 9 10 11 12 13 14 15 16 17 -50 0 50 100 150 200 250 300 350 400 educ exper age Linear Fit Linear Fit Linear Fit Multiple regression model For any combination of values of the predictor variables, the average value of the response (bsal) lies on bsali a= straight + 1fsex i + 2line: seniori + 3age i + 4 educ i + 5experi + i Just like in simple regression,

assume that follows a normal curve within any combination of predictors. Output from regression (fsex = 1 for females, = 0 for males) Term Estimate Std Error t Ratio Prob>|t| Int. 6277.9 652 9.62 <.0001 Senior 4.26 -767.9 <.0001 -22.6 <.0001 Age .88 0.63 .3837 Educ 3.71 92.3 .0004 Exper .47

0.50 .6364 128.9 Whole Model age Actual by Predicted Plot b s a l A c tu a l 8000 7000 6000 5000 4000 - 4000 5000 6000 7000 8000 bsal Predicted P<.0001 RSq=0.52 RMSE=508.09 Summary of Fit 5.3

- RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.515156 0.487291 508.0906 5420.323 93 Analysis of Variance .72 24.8 Source DF Sum of Squares Mean Square F Ratio Model Error C. Total 5 87 92

23863715 22459575 46323290 4772743 258156 18.4878 Prob > F Term Intercept fsex senior age educ exper Estimate Std Error t Ratio Prob>|t| 6277.8934 -767.9127 -22.5823 0.6309603 92.306023 0.5006397 652.2713 128.97 5.295732 0.720654 24.86354

1.055262 9.62 -5.95 -4.26 0.88 3.71 0.47 <.0001 <.0001 <.0001 0.3837 0.0004 0.6364 Effect Tests Source fsex senior age educ exper 1.05 <.0001 Parameter Estimates Nparm DF Sum of Squares F Ratio

Prob > F 1 1 1 1 1 1 1 1 1 1 9152264.3 4694256.3 197894.0 3558085.8 58104.8 35.4525 18.1838 0.7666 13.7827 0.2251 <.0001 <.0001 0.3837 0.0004 0.6364 Residual by Predicted Plot 1500 b s a l R e s id u a l Fsex

5.95 Response bsal 1000 500 0 -500 -1000 4000 5000 6000 7000 bsal Predicted 8000 educ exper Predictions Example: Prediction of beginning wages for a woman with 10 months seniority, that is 25 years old, with 12 years of education, and two years of experience: bsali = + 1fsex i + 2seniori + 3age i + 4 educ i + 5experi + i Pred. bsal = 6277.9 - 767.9*1 - 22.6*10 + .63*300 +

92.3*12 + .50*24 = 6592.6 Interpretation of coefficients in multiple regression Each estimated coefficient is amount Y is expected to increase when the value of its corresponding predictor is increased by one, holding constant the values of the other predictors. Example: equals estimated coefficient of education 92.3. For each additional year of education of employee, we expect salary to increase by about 92 dollars, holding all other variables constant. Estimated coefficient of fsex equals -767. For employees who started at the same time, had the same education and experience, and were the same age, women earned $767 less on average than men. Which variable is the strongest predictor of the outcome? The coefficient that has the strongest linear association with the outcome variable is the one with the largest absolute value of T, which equals the

coefficient over its SE. It is not size of coefficient. This is sensitive to scales of predictors. The T statistic is not, since it is a standardized measure. Example: In wages regression, seniority is a better predictor than education because it has a larger T. Hypothesis tests for coefficients The reported t-stats (coef. / SE) and p-values are used to test whether a particular coefficient equals 0, given that all other coefficients are in the model. Examples: 1) Test whether coefficient of education equals zero has p-value = .0004. Hence, reject the null hypothesis; it appears that education is a useful predictor of bsal when all the other predictors are in the model. 2) Test whether coefficient of experience equals zero has p-value = .6364. Hence, we cannot reject the null hypothesis; it appears that experience is not a particularly useful predictor of bsal when all other predictors are in the model. Hypothesis tests for coefficients The test statistics have the usual form (observed expected)/SE.

For p-value, use area under a t-curve with (n-k) degrees of freedom, where k is the number of terms in the model. In this problem, the degrees of freedom equal (93-6=87). CIs for regression coefficients A 95% CI for the coefficients is obtained in the usual way: coef. (multiplier) SE The multiplier is obtained from the t- curve with degrees of use normal Example: A regression (n-k) degrees of freedom. (If freedom is greater than 26 table) 95% CI for the population coefficient of age equals: (0.63 1.96*0.72, 0.63 + 1.96*0.72) Warning about tests and CIs Hypothesis tests and CIs are meaningful only when the data fits the model well. Remember, when the sample size is large enough, you

will probably reject any null hypothesis of =0. When the sample size is small, you may not have enough evidence to reject a null hypothesis of =0. When you fail to reject a null hypothesis, dont be too hasty to say that a predictor has no linear association with the outcome. It is likely that there is some association, it just isnt a very strong one. Checking assumptions Plot the residuals versus the predicted values from the regression line. Also plot the residuals versus each of the predictors. If non-random patterns in these plots, the assumptions might be violated. Plot of residuals versus predicted values fan shape. It suggests nonconstant variance (heteroscedastic) . We need to transform variables. Response sal77 Whole Model

Residual by Predicted Plot 5000 4000 3000 sal77 Residual This plot has a 2000 1000 0 -1000 -2000 -3000 7000 9000 11000 13000 15000 17000 sal77 Predicted Plots of residuals vs. predictors Fit Y by X Group Bivariate Fit of Residual bsal 2 By senior Bivariate Fit of Residual bsal 2 By age Bivariate Fit of Residual bsal 2 By educ 1000 1000 1000 500 0 Residual bsal 2 1500

Residual bsal 2 1500 Residual bsal 2 1500 500 0 500 0 -500 -500 -500 -1000 -1000 -1000 60 65 70 75 80 85

90 95 100 300 400 500 senior 600 700 800 7 8 9 10 age 1500 1000 1000 Residual bsal 2

Residual bsal 2 Bivariate Fit of Residual bsal 2 By fsex 1500 500 0 500 0 -500 -500 -1000 -1000 -50 0 50 100 150 200 250 300 350 400 exper -0.1 0 12 educ Fit Y by X Group Bivariate Fit of Residual bsal 2 By exper 11

.1 .2 .3 .4 .5 .6 .7 .8 .9 fsex 1 1.1 13 14 15 16 17 Summary of residual plots There appears to be a non-random pattern in the plot of residuals versus experience, and also versus age. This model can be improved. Modeling categorical predictors When predictors are categorical and assigned numbers, regressions using those numbers make no sense. Instead, we make dummy variables to stand in for the categorical variables. Collinearity When predictors are highly correlated, standard

errors are inflated Conceptual example: Suppose two variables Z and X are exactly the same. Suppose the population regression line of Y on X is Y = 10 + 5X Fit a regression using sample data of Y on both X and Z. We could plug in any value for the coefficients of X and Z, so long as they add up to 5. Equivalently this means that the standard errors for the coefficients are huge General warnings for multiple regression Be even more wary of extrapolation. Because there are several predictors, you can extrapolate in many ways Multiple regression shows association. It does not prove causality. Only a carefully designed observational study or randomized experiment can show causality