The Fundamentals of Political Science Research, 2nd Edition Chapter 10: Multiple Regression Model Specification Chapter 10 Outline Being Smart with Dummy Independent Variables in OLS Testing Interactive Hypotheses with Dummy

Variables Outliers and Influential Cases in OLS Multicollinearity Being Smart with Dummy Independent Variables in OLS In this section, though, we consider a series of scenarios involving independent variables that are not continuous:

Using Dummy Variables to Test Hypotheses about a Categorical Independent Variable with Only Two Values Using Dummy Variables to Test Hypotheses about a Categorical Independent Variable with More Than Two Values Using Dummy Variables to Test Hypotheses about Multiple Independent Variables Using Dummy Variables to Test Hypotheses about a

Categorical Independent Variable with Only Two Values We begin with a relatively simple case in which we have a categorical independent variable that takes on one of two possible values for all cases. Categorical variables like this are commonly referred to as dummy variables. The most common form of dummy variable is one that takes on values of one or zero.

These variables are also sometimes referred to as indicator variables when a value of one indicates the presence of a particular characteristic and a value of zero indicates the absence of that characteristic. Hillary Clinton Thermometer Scores Example

Data from 1996 NES. Dependent variable: Hillary Clinton Thermometer Rating Independent variables: Income and Gender Each respondent's gender was coded as equaling either 1 for male or 2 for female. Although we could leave this gender variable as it is and run our analyses,

we chose to use this variable to create two new dummy variables, male equaling 1 for yes and 0 for no, and female equaling 1 for yes and 0 for no. Our first inclination is to estimate an OLS model in which the specification is the following: Stata output when we include both gender dummy variables in our model

The dummy trap We can see that Stata has reported the results from the following model instead of what we asked for: This is the case because we have failed to meet the additional minimal mathematical criteria that we introduced when we moved from two-variable OLS to multiple OLS in Chapter 9 no perfect multicollinearity. The reason that we have failed to meet this is that, for two of the independent variables in our model, Male and Female, it is the case that

In other words, our variables Male and Female are perfectly correlated. This situation is known as the dummy trap. Avoiding the dummy trap To avoid the dummy-variable trap, we have to omit one of our dummy variables. But we want to be able to compare the effects of being male with the effects of being female to test our hypothesis.

How can we do this if we have to omit of one our two variables that measures gender? Before we answer this question, let's look at the results in a Table from the two different models in which we omit one of these two variables. We can learn a lot by looking at what is and what is not the same across these two models. Two models of the effects of gender and income on Hillary Clinton Thermometer scores

Regression lines from the model with a dummy variable for gender Using Dummy Variables to Test Hypotheses about a Categorical Independent Variable with More Than Two Values When we have a categorical variable with more than two categories and we want to include it in an OLS model, things get more

complicated. The best strategy for modeling the effects of such an independent variable is to include a dummy variable for all values of that independent variable except one. Using Dummy Variables to Test Hypotheses about a Categorical Independent Variable with More Than Two Values The value of the independent variable for which we do not include a dummy

variable is known as the reference category. This is the case because the parameter estimates for all of the dummy variables representing the other values of the independent variable are estimated in reference to that value of the independent variable. So let's say that we choose to estimate the following model: For this model we would be using None as our reference category for religious identification. This would mean that would be the estimated effect of being Protestant relative

to being nonreligious. The same model of religion and income on Hillary Clinton Thermometer scores with different reference categories Using Dummy Variables to Test Hypotheses about Multiple Independent Variables It is often the case that we will want to use multiple dummy independent variables in the same model.

Remember from Chapter 9 that, when we moved from a bivariate regression model to a multiple regression model, we had to interpret each parameter estimate as the estimated effect of a one-point increase in that particular independent variable on the dependent variable, while controlling for the effects of all other independent variables in the model. When we interpret the estimated effect of each dummy independent variable, we are interpreting the parameter estimate as the estimated effect of that variable having a value of one versus zero on the dependent variable, while controlling for the effects of all other independent

variables in the model, including the other dummy variables. Model of Bargaining Duration Two Overlapping Dummy Variables in Models by Martin and Vanberg Testing Interactive Hypotheses with Dummy Variables

All of the OLS models that we have examined so far have been what we could call additive models. To calculate the value for a particular case from an additive model, we simply multiply each independent variable value for that case by the appropriate parameter estimate and add these values together. Interactive models contain at least one independent variable that we create by multiplying together two or more independent variables. When we specify interactive models, we are testing theories about

how the effects of one independent variable on our dependent variable may be contingent on the value of another independent variable. Testing Interactive Hypotheses with Dummy Variables We begin with an additive model with the following specification: In this model we are testing the theory that a respondent's feelings toward Hillary

Clinton are a function of their feelings toward the women's movement and their own gender. This specification seems pretty reasonable, but we also want to test an additional theory that the effect of feelings toward the women's movement have a stronger effect on feelings toward Hillary Clinton among women than they do among men. In essence, we want to test the hypothesis that the slope of the line representing the relationship between Women's Movement Thermometer and Hillary Clinton Thermometer is steeper for women than it is for men.

Testing Interactive Hypotheses with Dummy Variables To test this hypothesis, we need to create a new variable that is the product of the two independent variables in our model and include this new variable in our model: By specifying our model as such, we have created two different models for women and men. So we can rewrite our model as

Testing Interactive Hypotheses with Dummy Variables And we can rewrite the formula for women as The effects of gender and feelings toward the women's movement on Hillary Clinton Thermometer scores

Regression lines from the interactive model Outliers and Influential Cases in OLS In the regression setting, individual cases can be outliers in several different ways: They can have unusual independent variable values. This is known as a case having large leverage. They can have large residual values (usually we look at squared residuals

to identify outliers of this variety). They can have both large leverage and large residual values. The relationship among these different concepts of outliers for a single case in OLS is often summarized as separate contributions to influence in the following formula: The relationship among these different concepts of outliers for a single case in OLS is often summarized as separate contributions to influence in the following formula:

Identifying Influential Cases One of the most famous cases of outliers/influential cases in political data comes from the 2000 U.S. presidential election in Florida. In an attempt to measure the extent to which ballot irregularities may have influenced election results, a variety of models were estimated in which the raw vote numbers for candidates across different counties were the dependent variables of interest. As an example of such a model, we will work with the following:

In this model the cases are individual counties in Florida, the dependent variable (Buchanani) is the number of votes in each Florida county for the independent candidate Patrick Buchanan, and the independent variable is the number of votes in each Florida county for the Democratic Party's nominee Al Gore (Gore i). Votes for Gore and Buchanan in Florida counties in the 2000 U.S. presidential election Stata lvr2plot for the model

presented in Previous Table OLS line with scatter plot for Florida 2000 The five largest (absolute-value) DFBETA scoresfor from the initial model DFBETA scores are calculated as the difference in the

parameter estimate without each case divided by the standard error of the original parameter estimate. Votes for Gore and Buchanan in Florida counties in the 2000 U.S. presidential election Multicollinearity We know from Chapter 9 that a minimal mathematical property for estimating a multiple OLS model is that there is no perfect multicollinearity.

Perfect multicollinearity, you will recall, occurs when one independent variable is an exact linear function of one or more other independent variables in a model. In practice, perfect multicollinearity is usually the result of a small number of cases relative to the number of parameters we are estimating, limited independent variable values, or model misspecification. A much more common and vexing issue is high multicollinearity. As a result, when people refer to multicollinearity, they almost always mean high multicollinearity. From here on, when we refer to multicollinearity, we will mean high, but less-than-perfect, multicollinearity.

Multicollinearity is induced by a small number of degrees of freedom and/or high correlation between independent variables. Venn diagram with multicollinearity Detecting Multicollinearity It is very important to know when you have multicollinearity. If we have a high ${R^2}$ statistic, but none (or very few) of our parameter

estimates is statistically significant, we should be suspicious of multicollinerity. We should also be suspicious of multicollinearity if we see that, when we add and remove independent variables from our model, the parameter estimates for other independent variables (and especially their standard errors) change substantially. A more formal way to diagnose multicollinearity is to calculate the variance inflation factor (VIF) for each of our independent variables. This calculation is based on an auxiliary regression model in which one independent variable, which we will call X j, is the dependent variable and all of the other independent variables are independent variables.

The R2 statistic from this auxiliary model, Rj2, is then used to calculate the VIF for variable j as follows: Multicollinearity: A Simulated Example To simulate multicollinearity, we are going to create a population with the following characteristics: Two variables X1i and X2i such that the correlation between them is 0.9. A variable ui randomly drawn from a normal distribution, centered around 0 with

variance equal to 1 A variable Yi such that We can see from the description of our simulated population that we have met all of the OLS assumptions, but that we have a high correlation between our two independent variables. Now we will conduct a series of random draws (samples) from this population and look at the results from the following regression models:

Random draws of increasing size from a population with substantial multicollinearity Multicollinearity: A Real-World Example We estimate a model of the thermometer scores for U.S. voters for George W. Bush in 2004. Our model specification is the following: Although we have distinct theories about the causal impact of each independent variable on peoples' feelings toward Bush, the table

on the next slide indicates that some of these independent variables are substantially correlated with each other. Pairwise correlations between independent variables Model results from random draws of increasing size from the 2004 NES

Multicollinearity: What Should I Do? The reason why multicollinearity is vexing is that there is no magical statistical cure for it. What is the best thing to do when you have multicollinearity? Easy (in theory): Collect more data. But data are expensive to collect. If we had more data, we would use them and we wouldn't have hit this problem in the first place. So, if you do not have an easy way increase your sample size, then

multicollinearity ends up being something that you just have to live with. It is important to know that you have multicollinearity and to present your multicollinearity by reporting the results of VIF statistics or what happens to your model when you add and drop the guilty variables.