Chapter 11 Regression with a Binary Dependent Variable Regression with a Binary Dependent Variable (SW Chapter 11) So far the dependent variable (Y) has been continuous: district-wide average test score traffic fatality rate What if Y is binary? Y = get into college, or not; X = years of education Y = person smokes, or not; X = income Y = mortgage application is accepted, or not; X = income, house characteristics, marital status, race 2 Example: Mortgage denial and race

The Boston Fed HMDA data set Individual applications for single-family mortgages made in 1990 in the greater Boston area 2380 observations, collected under Home Mortgage Disclosure Act (HMDA) Variables Dependent variable: Is the mortgage denied or accepted? Independent variables: income, wealth, employment status other loan, property characteristics race of applicant 3 The Linear Probability Model (SW Section 11.1) A natural starting point is the linear regression model with a

single regressor: Y i = 0 + 1X i + u i But: Y What does 1 mean when Y is binary? Is 1 = ? X What does the line 0 + 1X mean when Y is binary? What does the predicted value Y mean when Y is binary? For example, what does Y = 0.26 mean? 4 The linear probability model, ctd. Yi = 0 + 1Xi + u i Recall assumption #1: E(ui|Xi) = 0, so E(Yi|Xi) = E(0 + 1Xi + ui|Xi) = 0 + 1Xi When Y is binary, E(Y) = 1 Pr(Y=1) + 0 Pr(Y=0) = Pr(Y=1) so

E(Y|X) = Pr(Y=1|X) 5 The linear probability model, ctd. When Y is binary, the linear regression model Yi = 0 + 1 Xi + u i is called the linear probability model. The predicted value is a probability: E(Y|X=x) = Pr(Y=1|X=x) = prob. that Y = 1 given x Y = the predicted probability that Yi = 1, given X 1 = change in probability that Y = 1 for a given x: Pr(Y 1 | X x x ) Pr(Y 1 | X x ) 1 = x 6 Example: linear probability model, HMDA data Mortgage denial v. ratio of debt payments to income

(P/I ratio) in the HMDA data set (subset) 7 Linear probability model: HMDA data, ctd. = -.080 + .604P/I ratio deny (.032) (.098) (n = 2380) What is the predicted value for P/I ratio = .3? deny 1 | P / Iratio .3) = -.080 + .604 .3 = .151 Pr( Calculating effects: increase P/I ratio from .3 to .4: Pr(

deny 1 | P / Iratio .4) = -.080 + .604 .4 = .212 The effect on the probability of denial of an increase in P/I ratio from .3 to .4 is to increase the probability by .061, that is, by 6.1 percentage points (what?). 8 Linear probability model: HMDA data, ctd Next include black as a regressor: deny = -.091 + .559P/I ratio + .177black (.032) (.098) (.025) Predicted probability of denial: for black applicant with P/I ratio = .3: deny 1) = -.091 + .559 .3 + .177 1 = .254 Pr( for white applicant, P/I ratio = .3:

deny 1) = -.091 + .559 .3 + .177 0 = .077 Pr( difference = .177 = 17.7 percentage points Coefficient on black is significant at the 5% level Still plenty of room for omitted variable bias 9 The linear probability model: Summary Models Pr(Y=1|X) as a linear function of X Advantages: simple to estimate and to interpret inference is the same as for multiple regression (need heteroskedasticity-robust standard errors) Disadvantages: Does it make sense that the probability should be linear in X? Predicted probabilities can be <0 or >1! These disadvantages can be solved by using a nonlinear

probability model: probit and logit regression 10 Probit and Logit Regression (SW Section 11.2) The problem with the linear probability model is that it models the probability of Y=1 as being linear: Pr(Y = 1|X) = 0 + 1X Instead, we want: 0 Pr(Y = 1|X) 1 for all X Pr(Y = 1|X) to be increasing in X (for 1>0) This requires a nonlinear functional form for the probability. How about an S-curve 11 The probit model satisfies these conditions: 0 Pr(Y = 1|X) 1 for all X Pr(Y = 1|X) to be increasing in X (for 1>0)

12 Probit regression models the probability that Y=1 using the cumulative standard normal distribution function, evaluated at z = 0 + 1X: Pr(Y = 1|X) = (0 + 1X) is the cumulative normal distribution function. z = 0 + 1X is the z-value or z-index of the probit model. Example: Suppose 0 = -2, 1= 3, X = .4, so Pr(Y = 1|X=.4) = (-2 + 3 .4) = (-0.8) Pr(Y = 1|X=.4) = area under the standard normal density to left of z = -.8, which is 13 Pr(Z -0.8) = .2119 14 Probit regression, ctd.

Why use the cumulative normal probability distribution? The S-shape gives us what we want: 0 Pr(Y = 1|X) 1 for all X Pr(Y = 1|X) to be increasing in X (for 1>0) Easy to use the probabilities are tabulated in the cumulative normal tables Relatively straightforward interpretation: z-value = 0 + 1X + X is the predicted z-value, given X 0 1 1 is the change in the z-value for a unit change in X 15 STATA Example: HMDA data . probit deny p_irat, r; Iteration

Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood

Probit estimates Log likelihood = -831.79234 = -872.0853 = -835.6633 = -831.80534 = -831.79234 Well discuss this later Number of obs Wald chi2(1) Prob > chi2 Pseudo R2 = = = =

2380 40.68 0.0000 0.0462 -----------------------------------------------------------------------------| Robust deny | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------p_irat | 2.967908 .4653114 6.38 0.000

2.055914 3.879901 _cons | -2.194159 .1649721 -13.30 0.000 -2.517499 -1.87082 ------------------------------------------------------------------------------ deny 1| P / Iratio) = (-2.19 + 2.97 P/I ratio) Pr( (.16) (.47) 16 STATA Example: HMDA data, ctd. deny 1| P / Iratio) = (-2.19 + 2.97 P/I ratio) Pr( (.16) (.47)

Positive coefficient: does this make sense? Standard errors have the usual interpretation Predicted probabilities: deny 1 | P / Iratio .3) = (-2.19+2.97 .3) Pr( = (-1.30) = .097 Effect of change in P/I ratio from .3 to .4: Pr( deny 1 | P / Iratio .4) = (-2.19+2.97 .4) = .159 Predicted probability of denial rises from .097 to .159 17 Probit regression with multiple regressors Pr(Y = 1|X1, X2) = (0 + 1X1 + 2X2) is the cumulative normal distribution function. z = 0 + 1X1 + 2X2 is the z-value or z-index of the

probit model. 1 is the effect on the z-score of a unit change in X1, holding constant X2 18 STATA Example: HMDA data . probit deny p_irat black, r; Iteration Iteration Iteration Iteration 0: 1: 2: 3: log

log log log likelihood likelihood likelihood likelihood Probit estimates Log likelihood = -797.13604 = -872.0853 = -800.88504 = -797.1478 = -797.13604 Number of obs Wald chi2(2) Prob > chi2

Pseudo R2 = = = = 2380 118.18 0.0000 0.0859 -----------------------------------------------------------------------------| Robust deny | Coef. Std. Err. z P>|z|

[95% Conf. Interval] -------------+---------------------------------------------------------------p_irat | 2.741637 .4441633 6.17 0.000 1.871092 3.612181 black | .7081579 .0831877 8.51 0.000 .545113 .8712028 _cons | -2.258738 .1588168 -14.22 0.000

-2.570013 -1.947463 ------------------------------------------------------------------------------ Well go through the estimation details later 19 STATA Example, ctd.: predicted probit probabilities . probit deny p_irat black, r; Probit estimates Log likelihood = -797.13604 Number of obs Wald chi2(2) Prob > chi2 Pseudo R2

= = = = 2380 118.18 0.0000 0.0859 -----------------------------------------------------------------------------| Robust deny | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------p_irat |

2.741637 .4441633 6.17 0.000 1.871092 3.612181 black | .7081579 .0831877 8.51 0.000 .545113 .8712028 _cons | -2.258738 .1588168 -14.22 0.000 -2.570013 -1.947463

-----------------------------------------------------------------------------. sca z1 = _b[_cons]+_b[p_irat]*.3+_b[black]*0; . display "Pred prob, p_irat=.3, white: " normprob(z1); Pred prob, p_irat=.3, white: .07546603 NOTE _b[_cons] is the estimated intercept (-2.258738) _b[p_irat] is the coefficient on p_irat (2.741637) sca creates a new scalar which is the result of a calculation display prints the indicated information to the screen 20 STATA Example, ctd.

deny 1 | P / I , black ) Pr( = (-2.26 + 2.74 P/I ratio + .71 black) (.16) (.44) (.08) Is the coefficient on black statistically significant? Estimated effect of race for P/I ratio = .3: deny 1 | .3,1) = (-2.26+2.74 .3+.71 1) = .233 Pr( deny 1 | .3,0) = (-2.26+2.74 .3+.71 0) = .075 Pr( Difference in rejection probabilities = .158 (15.8 percentage points) Still plenty of room still for omitted variable bias 21 Logit Regression Logit regression models the probability of Y=1 as the cumulative standard logistic distribution function, evaluated

at z = 0 + 1X: Pr(Y = 1|X) = F(0 + 1X) F is the cumulative logistic distribution function: F(0 + 1X) = 1 1 e ( 0 1 X ) 22 Logit regression, ctd. Pr(Y = 1|X) = F(0 + 1X) where F(0 + 1X) = 1 1 e ( 0 1 X ) .

Example: 0 = -3, 1= 2, X = .4, so 0 + 1X = -3 + 2 .4 = -2.2 so Pr(Y = 1|X=.4) = 1/(1+e(2.2)) = .0998 Why bother with logit if we have probit? Historically, logit is more convenient computationally In practice, logit and probit are very similar 23 STATA Example: HMDA data . logit deny p_irat black, r; Iteration Iteration Iteration Iteration Iteration

0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood Logit estimates

Log likelihood = -795.69521 = -872.0853 = -806.3571 = -795.74477 = -795.69521 = -795.69521 Later Number of obs Wald chi2(2) Prob > chi2 Pseudo R2 = = = =

2380 117.75 0.0000 0.0876 -----------------------------------------------------------------------------| Robust deny | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------p_irat | 5.370362 .9633435 5.57 0.000

3.482244 7.258481 black | 1.272782 .1460986 8.71 0.000 .9864339 1.55913 _cons | -4.125558 .345825 -11.93 0.000 -4.803362 -3.447753 -----------------------------------------------------------------------------. > dis "Pred prob, p_irat=.3, white: "

1/(1+exp(-(_b[_cons]+_b[p_irat]*.3+_b[black]*0))); Pred prob, p_irat=.3, white: .07485143 NOTE: the probit predicted probability is .07546603 24 Predicted probabilities from estimated probit and logit models usually are (as usual) very close in this application. 25 Example for class discussion: Characterizing the Background of Hezbollah Militants Source: Alan Krueger and Jitka Maleckova, Education, Poverty and Terrorism: Is There a Causal Connection? Journal of Economic Perspectives, Fall 2003, 119-144. Logit regression: 1 = died in Hezbollah military event Table of logit results:

26 27 28 Hezbollah militants example, ctd. Compute the effect of schooling by comparing predicted probabilities using the logit regression in column (3): Pr(Y=1|secondary = 1, poverty = 0, age = 20) Pr(Y=0|secondary = 0, poverty = 0, age = 20): Pr(Y=1|secondary = 1, poverty = 0, age = 20) = 1/[1+e(5.965+.2811 .3350 .08320)] = 1/[1 + e7.344] = .000646 does this make sense? Pr(Y=1|secondary = 0, poverty = 0, age = 20) = 1/[1+e(5.965+.2810 .3350 .08320)] = 1/[1 + e7.625] = .000488 does this make sense? 29

Predicted change in probabilities: Pr(Y=1|secondary = 1, poverty = 0, age = 20) Pr(Y=1|secondary = 1, poverty = 0, age = 20) = .000646 .000488 = .000158 Both these statements are true: The probability of being a Hezbollah militant increases by 0.0158 percentage points, if secondary school is attended. The probability of being a Hezbollah militant increases by 32%, if secondary school is attended (.000158/.000488 = .32). These sound so different! what is going on? 30 Estimation and Inference in Probit (and Logit) Models (SW Section 11.3) Probit model: Pr(Y = 1|X) = (0 + 1X) Estimation and inference

How can we estimate 0 and 1? What is the sampling distribution of the estimators? Why can we use the usual methods of inference? First motivate via nonlinear least squares Then discuss maximum likelihood estimation (what is actually done in practice) 31 Probit estimation by nonlinear least squares Recall OLS: n min b0 ,b1 [Yi (b0 b1 X i )]2 i 1 The result is the OLS estimators 0 and 1 Nonlinear least squares estimator of probit coefficients: n

min b0 ,b1 [Yi (b0 b1 X i )]2 i 1 How to solve this minimization problem? Calculus doesnt give and explicit solution. Solved numerically using the computer(specialized minimization algorithms) In practice, nonlinear least squares isnt used because it isnt efficient an estimator with a smaller variance is 32 Probit estimation by maximum likelihood The likelihood function is the conditional density of Y1,,Yn given X1,,Xn, treated as a function of the unknown parameters 0 and 1. The maximum likelihood estimator (MLE) is the value of (0, 1) that maximize the likelihood function.

The MLE is the value of (0, 1) that best describe the full distribution of the data. In large samples, the MLE is: consistent normally distributed efficient (has the smallest variance of all estimators) 33 Special case: the probit MLE with no X 1 with probability p Y= (Bernoulli distribution) 0 with probability 1 p Data: Y1,,Yn, i.i.d. Derivation of the likelihood starts with the density of Y1:

so Pr(Y1 = 1) = p and Pr(Y1 = 0) = 1p Pr(Y1 = y1) = p y1 (1 p )1 y1 (verify this for y1=0, 1!) 34 Joint density of (Y1,Y2): Because Y1 and Y2 are independent, Pr(Y1 = y1,Y2 = y2) = Pr(Y1 = y1) Pr(Y2 = y2) = [ p y1 (1 p )1 y1 ] [ p y2 (1 p )1 y2 ] = p y1 y2 (1 p ) 2 ( y1 y2 )

Joint density of (Y1,..,Yn): Pr(Y1 = y1,Y2 = y2,,Yn = yn) = [ p y1 (1 p )1 y1 ] [ p y2 (1 p )1 y2 ] [ p yn (1 p )1 yn ] n i 1 yi i 1 = p (1 p ) n yi n 35

The likelihood is the joint density, treated as a function of the unknown parameters, which here is p: n n n Y Y i 1 i i 1 i f(p;Y ,,Y ) = p (1 p ) 1 n

The MLE maximizes the likelihood. Its easier to work with the logarithm of the likelihood, ln[f(p;Y1,,Yn)]: ln[f(p;Y1,,Yn)] = Y ln( p) n Y ln(1 p) n n i 1 i i 1 i Maximize the likelihood by setting the derivative = 0: n n 1 d ln f ( p;Y1 ,...,Yn ) 1

= i 1Yi n i 1Yi =0 dp p 1 p Solving for p yields the MLE; that is, p MLE satisfies, 36 n

1 i 1 i MLE Y p or n 1 i 1 i MLE Y p 1

n i 1Yi =0 MLE 1 p n 1 n i 1Yi 1 p MLE n

or Y p MLE 1 Y 1 p MLE or p MLE = Y = fraction of 1s whew a lot of work to get back to the first thing you would think of usingbut the nice thing is that this whole approach generalizes to more complicated models... 37 The MLE in the no-X case (Bernoulli distribution), ctd.:

p MLE = Y = fraction of 1s For Yi i.i.d. Bernoulli, the MLE is the natural estimator of p, the fraction of 1s, which is Y We already know the essentials of inference: In large n, the sampling distribution of p MLE = Y is normally distributed Thus inference is as usual: hypothesis testing via tstatistic, confidence interval as 1.96SE 38 The MLE in the no-X case (Bernoulli distribution), ctd: The theory of maximum likelihood estimation says that p MLE is the most efficient estimator of p of all possible estimators at least for large n. (Much stronger than the Gauss-Markov theorem). This is why people use the MLE. STATA note: to emphasize requirement of large-n, the printout calls the t-statistic the z-statistic; instead of the Fstatistic, the chi-squared statistic (= q F).

Now we extend this to probit in which the probability is conditional on X the MLE of the probit coefficients. 39 The probit likelihood with one X The derivation starts with the density of Y1, given X1: Pr(Y1 = 1|X1) = (0 + 1X1) Pr(Y1 = 0|X1) = 1(0 + 1X1) so Pr(Y1 = y1|X1) = ( 0 1 X 1 ) y1 [1 ( 0 1 X 1 )]1 y1 The probit likelihood function is the joint density of Y1,,Yn given X1,,Xn, treated as a function of 0, 1: f(0,1; Y1,,Yn|X1,,Xn) = { ( 0 1 X 1 )Y1 [1 ( 0 1 X 1 )]1 Y1 } { ( 0 1 X n )Yn [1 ( 0 1 X n )]1 Yn } 40 The probit likelihood function: f(0,1; Y1,,Yn|X1,,Xn)

= { ( 0 1 X 1 )Y1 [1 ( 0 1 X 1 )]1 Y1 } { ( 0 1 X n )Yn [1 ( 0 1 X n )]1 Yn } Cant solve for the maximum explicitly Must maximize using numerical methods As in the case of no X, in large samples: 0MLE , 1MLE are consistent 0MLE , 1MLE are normally distributed 0MLE , 1MLE are asymptotically efficient among all estimators (assuming the probit model is the correct model) 41 The Probit MLE, ctd. Standard errors of 0MLE , 1MLE are computed automatically Testing, confidence intervals proceeds as usual For multiple Xs, see SW App. 11.2 42

The logit likelihood with one X The only difference between probit and logit is the functional form used for the probability: is replaced by the cumulative logistic function. Otherwise, the likelihood is similar; for details see SW App. 11.2 As with probit, 0MLE , 1MLE are consistent 0MLE , 1MLE are normally distributed Their standard errors can be computed Testing, confidence intervals proceeds as usual 43 Measures of fit for logit and probit The R2 and R 2 dont make sense here (why?). So, two other specialized measures are used: 1. The fraction correctly predicted = fraction of Ys for

which predicted probability is >50% (if Yi=1) or is <50% (if Yi=0). 2. The pseudo-R2 measure the fit using the likelihood function: measures the improvement in the value of the log likelihood, relative to having no Xs (see SW App. 11.2). This simplifies to the R2 in the linear model with normally distributed errors. 44 Application to the Boston HMDA Data (SW Section 11.4) Mortgages (home loans) are an essential part of buying a home. Is there differential access to home loans by race? If two otherwise identical individuals, one white and one black, applied for a home loan, is there a difference in the probability of denial? 45

The HMDA Data Set Data on individual characteristics, property characteristics, and loan denial/acceptance The mortgage application process circa 1990-1991: Go to a bank or mortgage company Fill out an application (personal+financial info) Meet with the loan officer Then the loan officer decides by law, in a race-blind way. Presumably, the bank wants to make profitable loans, and the loan officer doesnt want to originate defaults. 46 The loan officers decision Loan officer uses key financial variables: P/I ratio housing expense-to-income ratio loan-to-value ratio

personal credit history The decision rule is nonlinear: loan-to-value ratio > 80% loan-to-value ratio > 95% (what happens in default?) credit score 47 Regression specifications Pr(deny=1|black, other Xs) = linear probability model probit Main problem with the regressions so far: potential omitted variable bias. All these (i) enter the loan officer decision function, all (ii) are or could be correlated with race: wealth, type of employment credit history family status The HMDA data set is very rich 48

49 50 51 Table 11.2, ctd. 52 Table 11.2, ctd. 53 Summary of Empirical Results Coefficients on the financial variables make sense. Black is statistically significant in all specifications Race-financial variable interactions arent significant.

Including the covariates sharply reduces the effect of race on denial probability. LPM, probit, logit: similar estimates of effect of race on the probability of denial. Estimated effects are large in a real world sense. 54 Remaining threats to internal, external validity Internal validity 1. omitted variable bias what else is learned in the in-person interviews? 2. functional form misspecification (no) 3. measurement error (originally, yes; now, no) 4. selection random sample of loan applications define population to be loan applicants 5. simultaneous causality (no)

External validity This is for Boston in 1990-91. What about today? 55 Summary (SW Section 11.5) If Yi is binary, then E(Y| X) = Pr(Y=1|X) Three models: linear probability model (linear multiple regression) probit (cumulative standard normal distribution) logit (cumulative standard logistic distribution) LPM, probit, logit all produce predicted probabilities Effect of X is change in conditional probability that Y=1. For logit and probit, this depends on the initial X Probit and logit are estimated via maximum likelihood Coefficients are normally distributed for large n Large-n hypothesis testing, conf. intervals is as usual

56