Using and Applying Multiple Regression Analysis: OLS Hierarchical / Sequential Modeling in SPSS Faculty Research Workshop February 19, 2014 Tom Lehman, Ph.D. Professor, Department of Economics Indiana Wesleyan University Overview Introduction to Correlation Analysis and OLS OLS Multiple Regression Analysis Basics OLS Hierarchical Multiple Regression

Definition, Purposes and Technique Examples of OLS Hierarchical Modeling Q&A References Correlation Analysis: Its All About Relationships Correlation analysis hypothesis testing is designed to investigate if two or more variables are correlated or co-vary together, and if this covariance is statistically significant.

Examples: What is the relationship between housing costs and monthly rental prices in urban housing markets? What is the relationship between GDP growth and growth in capital investment expenditures in the macroeconomy? What is the relationship between education levels and the unemployment rate in an urban economy? Does advertising expenditure increase sales revenues? If we spend $100,000 on ad costs, what are predicted gross sales? What is the relationship between county levels of educational attainment and county household income? Bivariate Correlation

Analysis Bivariate = two variables Exploring the relationship between a dependent variable (DV) and a single independent variable (IV) Ideally both variables should be interval or ratio level (continuous) Categorical (nominal or ordinal) variables do not work as well in regression analysis (exception: dummy variables in multiple regression) Dependent Variable (DV): The variable (measurable data used to operationalize a

concept) that is thought to depend upon or be influenced by another The variable the value of which is being predicted or estimated Independent Variable (IV): The variable that is thought to (hypothesized to) influence the behavior of the DV. The IV is sometimes referred to as a predictor variable; it may predict the behavior of the DV We utilize the values of an IV to predict or estimate the value of a DV in a regression equation: Y = a + bX The Scatter Plot Diagram values of the DV Y Y

values of the IV X X X Y Linear Relationship: The Best Fit (Regression) Line The best fit line (a.k.a. the predicted regression line) assumes a linear relationship; it traces a path through the scatter plots that is, on average, equidistant between each data point. values of the DV Y

Y OLS regression attempts to minimize the sum of the squared distance between the observed data points and the regression line of predicted plots. values of the IV X X Coefficient of Correlation Pearsons r Pearsons product-moment correlation coefficient

A computed value between -1.00 and +1.00 that measures the strength of association between X (IV) and Y (DV) The closer the value of Pearsons r to 1.00, the stronger the association A value of -1.00 is a perfect negative correlation A value of +1.00 is a perfect positive correlation A value of 0 is a perfect zero correlation A positive Pearsons r = positive or direct relationship A negative Pearsons r = negative or inverse relationship

Positive Pearsons r and the X and Y Means Quadrants Y X mean An X value above the X mean correlates with a Y value above the Y mean in the upper right quadrant (leads to + coefficient) Mean of the Y variable Y mean Very few outliers in the opposing two

quadrants An X value below the X mean correlates with a Y value below the Y mean in the lower left quadrant Mean of the X variable (leads to + coefficient) X Negative Pearsons r and the X and Y Means Quadrants X mean Y Very few outliers in the opposing two quadrants Mean of the

Y variable Y mean An X value below the X mean correlates with a Y value above the Y mean in the upper left quadrant (leads to - coefficient) Mean of the X variable An X value above the X mean correlates with a Y value below the Y mean in the lower right quadrant (leads to - coefficient) X Pearsons R2

Coefficient of Determination The coefficient of determination is the squared value of Pearsons r expressed as an absolute value (+) percentage Pearsons R2 is a measure of the percent of variation in the DV explained (or accounted for) by the variation in the IV Example: If r = +0.849, then R2 = 0.721 Interpretation: Roughly 72.1% of the variation in the DV can be explained by the variation

The Regression Line and the Least Squares Principle The regression equation: Y = a + bX Regression analysis and the regression equation are used to predict the best-fit regression line from the X-Y data Simply hand-drawing a best-fit line through a scatter plot is subjective and unreliable We need to use a precise statistical method to estimate the true best-fit regression line Estimated Y value (Y) = Y-intercept + slope(given value of X) Least-squares principle

The best-fit regression line is statistically estimated by minimizing the sum of the squared vertical difference between the actual Y values (Y) and the predicted Y values (Y) Minimizing the distances between the best-fit line (Y) and the actual values of Y Minimizing S(Y Y)2 Multiple Correlation Analysis CLR: Classic Linear Regression Multiple = more than two variables (more precise, more thorough than simple bivariate regression analysis) Exploring the relationship between a dependent variable (DV) and two or more independent variables (IV) Variables must be interval or ratio level (continuous)

Dependent Variable (DV): The variable (measurable data used to operationalize a concept) that is thought to depend upon or be influenced by another The variable the value of which is being predicted or estimated Independent Variables (IVs): The variables that are, together, thought to (hypothesized to) influence the behavior of the DV. The IVs are sometimes referred to as predictor variables;

together, they may predict the behavior of the DV We utilize the values of IVs to predict or estimate the value of a DV in a regression equation: Y = a + b 1X1 + b2X2 bnXn Why Multiple Regression? Controlling for Other Factors Multiple regression analysis allows us to investigate the relationship or correlation between several IVs and a continuous DV while controlling for the effects of all the other IVs in the regression equation In other words, we can observe the impact of a

single IV on a DV while controlling for the effects of several other IVs simultaneously Multiple regression allows us to hold constant the other IVs in the equation so that we can analyze the impact of each IV on the DV net of the disturbances of other factors See Grimm and Yarnold, 1995; Gujarati, 1995; Kennedy, 2008; Tabachnick and Fidel, 2012 Bivariate and Multiple Linear Regression Assumptions (Berry, 1993) For each value of X (IV) there is a group of Y (DV) values, and these values must be normally distributed

The means of these Y values lie on the predicted regression line The DV must be a continuous variable (ratio or interval), not categorical The relationship between the DV and all IVs must be linear, not curvilinear The mean of the residuals (Y-Y) must equal 0 The DV is statistically independent, no autocorrelation with itself (i.e., the DV cannot be autocorrelated with successive observations of itself; one of the DV values cannot have influenced another of the DV values, such as often occurs in time-series data) Homoscedasticity: the values of the Y-Y residuals must be equal over the entire range of the predicted regression line; must be the same for all values of X, cannot be

heteroscedastic (Kennedy, 2008) Multiple IVs included in the regression model cannot suffer from multicollinearity with each other An Example of Heteroscedasticity The error terms or residuals Y-Y are not equal along the entire regression line. As the value of the IV increases, the Y-Y residuals get larger and larger, and the data points fan out wider about the regression line.

values of the DV Y Y values of the IV X X Types of OLS Multiple Regression (Tabachnick and Fidel, 2012) Standard or Simultaneous Multiple Regression Technique All IVs are entered into the model simultaneously;

reveals only the unique effects of each IV on the DV A single model is constructed with all IVs included at the same time Hierarchical or Sequential Multiple Regression Technique Sets of IVs are entered into each regression model systematically, perhaps one by one Allows the analyst to determine how much additional variance in the DV (R2) is explained by adding consecutive additional IVs in a systematic pattern Multiple regression models are generated with each successive model exhibiting more IVs than the previous models (Grimm and Yarnold, 1995). Overview of Hierarchical MRC

(Tabachnick and Fidel, 2012; Grimm and Yarnold, 1995) The first regression estimation (model) contains one or more predictors The next estimation (model) adds one or more new predictors to those used in the first analysis The change in R2 between consecutive models represents the proportion of variance in the DV shared exclusively with the newly entered variables Caution: the partial coefficients on each consecutive model are not directly comparable to one another. The impact of the variables entered in earlier steps are

partialed from correlations involving variables entered in later steps (Grimm and Yarnold, 1995) Example #1 Standard or simultaneous multiple regression estimation on a set of IVs SPSS (Sweet and Grace-Martin, 2011) Variables: DV: county median household income 2011 IVs: Geographic border to metro county Educational attainment measures Interstate highway density Population density

Labor force participation rate Example #2 Hierarchical multiple regression estimation using same IVs entered consecutively or sequentially Variables: DV: county median household income 2011 IVs: Geographic border to metro county Educational attainment measures Interstate highway density Population density Labor force participation rate Example #3

Hierarchical multiple regression estimation using different model and variables Variables: DV: county per capital income 2011 IVs: County workforce composition: farming and professional Water amenities: square miles of county water area Taxation: aggregate R/E taxes per capita Immigration: county share foreign-born population Q&A References

Berry, W.D. (1993). Understanding Regression Assumptions. Newbury Park, CA: Sage Publications. Grimm, L.G. and Yarnold, P.R. (1995). Reading and Understanding Multivariate Statistics. Washington, D.C.: American Psychological Association. Gujarati, D.N. (1995). Basic Econometrics, 3rd edition. New York, NY: McGraw-Hill/Irwin.

Kennedy, P. (2008). A Guide to Econometrics, 6th edition. New York, NY: Wiley-Blackwell. Lind, D.A., Marchal, W.G. and Wathen, S. (2011). Statistical Techniques in Business and Economics, 15th edition. Boston, MA: McGraw-Hill/Irwin. Sweet, S.A and Grace-Martin, K. (2011). Data Analysis With SPSS: A First Course in Applied Statistics, 3rd edition. Boston, MA: Allyn and Bacon. Tabachnick, B.S. and Fidel, L.S. (2012). Using Multivariate Statistics, 6th edition. Boston, MA: Allyn and Bacon. Examples of Recent Research Using Hierarchical Regression Techniques

Cimasi, R.J., Sharamitaro, A.P., and Seiler, R.L. (2013). The association between health literacy and preventable hospitalizations in Missouri: Implications in an era of reform. Journal of Health Care Finance, 40(2), 1-16. Ruiz-Palomino, P., Saez-Martinez, F.J., and Martinez-Canas, R. (2013). Understanding pay satisfaction: Effects of supervisor ethical leadership on job motivating potential influence. Journal of Business Ethics, 118, 31-43. Tartaglia, S. (2013). Different predictors of quality of life in urban environment. Social Indicators Research, 113, 10451053. Zhoutao, C., Chen, J., and Song, Y. (2013). Does total rewards reduce core employees turnover intention? International Journal of Business and Management, 8(20), 62-75.