Week 10 Nov 3-7 Two Mini-Lectures QMM 510 Fall 2014 Chapter 15 Chi-Square Tests ML 10.1 Chapter Contents 15.1 Chi-Square Test for Independence 15.2 Chi-Square Tests for Goodness-of-Fit 15.3 Uniform Goodness-of-Fit Test 15.4 Poisson Goodness-of-Fit Test 15.5 Normal Chi-Square Goodness-of-Fit Test

So many topics, so little time 15.6 ECDF Tests (Optional) 15-2 Chapter 15 Chi-Square Test for Independence Contingency Tables A contingency table is a cross-tabulation of n paired observations into categories.

Each cell shows the count of observations that fall into the category defined by its row (r) and column (c) heading. 15-3 Chapter 15 Chi-Square Test for Independence Contingency Tables For example: 15-4 Chapter 15 Chi-Square Test for Independence Chi-Square Test

In a test of independence for an r x c contingency table, the hypotheses are H0: Variable A is independent of variable B H1: Variable A is not independent of variable B Use the chi-square test for independence to test these hypotheses. This nonparametric test is based on frequencies. The n data pairs are classified into c columns and r rows and then the observed frequency fjk is compared with the expected frequency ejk.

15-5 Chapter 15 Chi-Square Test for Independence Chi-Square Distribution The critical value comes from the chi-square probability distribution with d.f. degrees of freedom. where d.f. = degrees of freedom = (r 1)(c 1) r = number of rows in the table c = number of columns in the table

Appendix E contains critical values for right-tail areas of the chi-square distribution, or use Excels =CHISQ.DIST.RT(,d.f.) The mean of a chi-square distribution is d.f. with variance 2d.f. 15-6 Chapter 15 Chi-Square Test for Independence Chi-Square Distribution Consider the shape of the chi-square distribution: 15-7 Chapter 15

Chi-Square Test for Independence Expected Frequencies Assuming that H0 is true, the expected frequency of row j and column k is: ejk = RjCk/n where Rj = total for row j (j = 1, 2, , r) Ck = total for column k (k = 1, 2, , c) n = sample size 15-8 Chapter 15 Chi-Square Test for Independence Steps in Testing the Hypotheses

Step 1: State the Hypotheses H0: Variable A is independent of variable B H1: Variable A is not independent of variable B Step 2: Specify the Decision Rule Calculate d.f. = (r 1)(c 1) For a given , look up the right-tail critical value (2R) from Appendix E or by using Excel =CHISQ.DIST.RT(,d.f.).

Reject H0 if 2R > test statistic. 15-9 Chapter 15 Chi-Square Test for Independence Steps in Testing the Hypotheses For example, for d.f. = 6 and = .05, 2.05 = 12.59. 15-10 Chapter 15 Chi-Square Test for Independence

Steps in Testing the Hypotheses Here is the rejection region. 15-11 Chapter 15 Chi-Square Test for Independence Steps in Testing the Hypotheses Step 3: Calculate the Expected Frequencies ejk = RjCk/n For example, 15-12

Chapter 15 Chi-Square Test for Independence Steps in Testing the Hypotheses Step 4: Calculate the Test Statistic The chi-square test statistic is Step 5: Make the Decision Reject H0 if test statistic 2calc > 2R or if the p-value . 15-13

Chapter 15 Chi-Square Test for Independence Example: MegaStat all cells have ejk 5 so Cochrans Rule is met Caution: Dont highlight row or column totals p-value = 0.2154 is not small enough to reject the hypothesis of independence at = .05 15-14 Chapter 15 Chi-Square Test for Independence Test of Two Proportions

For a 2 2 contingency table, the chi-square test is equivalent to a twotailed z test for two proportions. The hypotheses are: Figure 14.6 15-15 Chapter 15 Chi-Square Test for Independence Small Expected Frequencies The chi-square test is unreliable if the expected frequencies are too small.

Rules of thumb: Cochrans Rule requires that ejk > 5 for all cells. Up to 20% of the cells may have ejk < 5 Most agree that a chi-square test is infeasible if ejk < 1 in any cell. If this happens, try combining adjacent rows or columns to enlarge the expected frequencies. 15-16 Chapter 15

Chi-Square Test for Independence Cross-Tabulating Raw Data Chi-square tests for independence can also be used to analyze quantitative variables by coding them into categories. For example, the variables Infant Deaths per 1,000 and Doctors per 100,000 can each be coded into various categories: 15-17 Chapter 15 Chi-Square Test for Independence Why Do a Chi-Square Test on Numerical Data?

The researcher may believe theres a relationship between X and Y, but doesnt want to use regression. There are outliers or anomalies that prevent us from assuming that the data came from a normal population. The researcher has numerical data for one variable but not the other. 15-18 Chapter 15

Chi-Square Test for Independence 3-Way Tables and Higher More than two variables can be compared using contingency tables. However, it is difficult to visualize a higher-order table. For example, you could visualize a cube as a stack of tiled 2-way contingency tables. Major computer packages permit three-way tables.

15-19 Purpose of the Test The goodness-of-fit (GOF) test helps you decide whether your sample resembles a particular kind of population. The chi-square test is versatile and easy to understand. Hypotheses for GOF tests: The hypotheses are: H0: The population follows a _____ distribution H1: The population does not follow a ______ distribution

The blank may contain the name of any theoretical distribution (e.g., uniform, Poisson, normal). 15-20 Chapter 15 Chi-Square Tests for Goodness-of-Fit ML 10.2 Chapter 15 Chi-Square Tests for Goodness-of-Fit Test Statistic and Degrees of Freedom for GOF Assuming n observations, the observations are grouped into c classes and then the chi-square test statistic is found using:

where fj = the observed frequency of observations in class j ej = the expected frequency in class j if the sample came from the hypothesized population 15-21 Chapter 15 Chi-Square Tests for Goodness-of-Fit Test Statistic and Degrees of Freedom for GOF tests If the proposed distribution gives a good fit to the sample, the test statistic will be near zero.

The test statistic follows the chi-square distribution with degrees of freedom d.f. = c m 1. where c is the number of classes used in the test and m is the number of parameters estimated. 15-22 Chapter 15 Normal Chi-Square GOF Test Is the Sample from a Normal Population?

Many statistical tests assume a normal population, so this the most common GOF test. Two parameters, the mean and the standard deviation , fully describe a normal distribution. Unless and are known a priori, they must be estimated from a sample in order to perform a GOF test for normality. 15-23 Method 1: Standardize the Data Transform sample observations x1, x2, , xn into standardized z-values.

Count the sample observations within each interval on the z-scale and compare them with expected normal frequencies ej. Problem: Frequencies will be small in the end bins yet large in the middle bins (this may violate Cochrans Rule and seems inefficient). 15-24 Chapter 15 Normal Chi-Square GOF Test Chapter 15 Normal Chi-Square GOF Test Method 2: Equal Bin Widths

Step 1: Divide the exact data range into c groups of equal width, and count the sample observations in each bin to get observed bin frequencies fj. Step 2: Convert the bin limits into standardized z-values: Step 3: Find the normal area within each bin assuming a normal distribution. Step 4: Find expected frequencies ej by multiplying each normal area by the sample size n. Problem: Frequencies will be small in the end bins yet large in the

middle bins (this may violate Cochrans Rule and seems inefficient). 15-25 Chapter 15 Normal Chi-Square GOF Test Method 3: Equal Expected Frequencies Define histogram bins in such a way that an equal number of observations would be expected under the hypothesis of a normal population, i.e., so that ej = n/c. A normal area of 1/c is expected in each bin. The first and last classes must be open-ended, so to define c bins we need c-1 cut points. Count the observations fj within each bin. Compare the fj with the expected frequencies ej = n/c. Advantage: Makes efficient use of the sample. Disadvantage: Cut points on the z-scale points may seem strange.

15-26 Chapter 15 Normal Chi-Square GOF Test Method 3: Equal Expected Frequencies Standard normal cut points for equal area bins. Table 15.16 15-27 Chapter 15 Normal Chi-Square GOF Test Critical Values for Normal GOF Test

Two parameters, m and s, are estimated from the sample, so the degrees of freedom are d.f. = c m 1. We need at least four bins to ensure at least one degree of freedom. Small Expected Frequencies Cochrans Rule suggests at least ej 5 in each bin (e.g., with 4 bins we would want n 20, and so on). 15-28 Chapter 15 Normal Chi-Square GOF Test Visual Tests

The fitted normal superimposed on a histogram gives visual clues as to the likely outcome of the GOF test. A simple eyeball inspection of the histogram may suffice to rule out a normal population by revealing outliers or other nonnormality issues. 15-29 Chapter 15 ECDF Tests ML 10.3 ECDF Tests for Normality

There are alternatives to the chi-square test for normality based on the empirical cumulative distribution function (ECDF). ECDF tests are done by computer. Details are omitted here. A small p-value casts doubt on normality of the population. The Kolmogorov-Smirnov (K-S) test uses the largest absolute difference between the actual and expected cumulative relative frequency of the n data values.

The Anderson-Darling (A-D) test is based on a probability plot. When the data fit the hypothesized distribution closely, the probability plot will be close to a straight line. The A-D test is widely used because of its power and attractive visual. 15-30 Chapter 15 ECDF Tests Example: Minitabs Anderson-Darling Test for Normality Data: weights of 80 babies (in ounces) Near-linear probability plot suggests good fit to normal distribution p-value = 0.122 is not small

enough to reject normal population at = .05 15-31 Chapter 15 ECDF Tests Example: MegaStats Normality Tests Data: weights of 80 babies (in ounces) p-value = 0.2487 is not small enough to reject normal population at = .05 in this chi-square test Near-linear probability plot suggests good fit to

normal distribution Note: MegaStats chi-square test is not as powerful as the A-D test, so we would prefer the A-D test if software is available. The MegaStat probability plot is good, but shows no p-value. 15-32