# 1. Find an equation of the circle with 1. Find an equation of the circle with centre , which passes through the point 2. Work out 3. Express in the form : 2 2 ( 1 ) + ( +2 ) =25 1 12 2 3 7 3

2 4 ( ) Chapter 15.1 Presenting Different Types of Data In God we trust. All others must bring data. - W Edwards Deming Important Vocabulary Variable, random variable The score you get when you throw two dice is one of the values 2, 3, . . . up to 12. Rather than repeatedly using the phrase The score you get when you throw two dice, it is usual to use an upper case letter, like X, to denote it. Because its value varies, X is

called a variable. Because it varies at random ? it is called a random variable. Particular values of the variable X are often denoted by the equivalent lower case letter, x. So in this case x can be 2, 3, . . . up to 12. Frequency The number of times a particular value of a variable occurs in a data set is called its frequency. ? Categorical (or qualitative) data Categorical (or qualitative) data come in classes or categories, like types of bird or makes of car. Categorical data are also called qualitative, particularly if they can be ? described without using numbers. Chapter 15.1 Key words Important Vocabulary

Numerical (or quantitative) data Numerical (or quantitative) data are defined in some way by numbers. Examples include the times people take to run a race, ? the numbers of trucks in freight trains and the birth weights of babies. Ranked data Ranked data are given by their position within a group rather than by measurements or score. For example the competitors in a competition could be given their positions as ? 1st, 2nd, 3rd . . . . Discrete variables The number of children in families (0, 1, 2, 3,), the number of goals a football team scores in a match (0, 1, 2, 3, . . .) and shoe sizes in the UK (1, 1., 2, 2., . . .) are all ? certain particular values but not those in examples of discrete variables. They can take between. Continuous variables Distance, mass, temperature and speed are all continuous variables. A continuous variable, if measured accurately enough, can take any appropriate value. You cannot list

? all the possible values. Chapter 15.1 Key words Important Vocabulary Distribution, unimodal and bimodal distributions, positive and negative skew The pattern in which the values of a variable occur is called its distribution. This is often displayed in a diagram with the variable on the horizontal scale and the frequency (or probability) on the vertical scale. If the diagram has one peak the distribution is unimodal; if there are two distinct peaks it is bimodal. If the peak is to the left of the middle the distribution has positive skew and if it is to the right there is negative skew. Chapter 15.1 Key words When a diagram showing a distribution is based on unbiased sample data, the larger the sample size, the more representative it will be of the underlying distribution.

Important Vocabulary Grouped data When there are many possible values of the variable, it is convenient to allocate the data to groups. An example of the use of grouped data is the way people are allocated to age groups. Chapter 15.1 Bivariate and multivariate data In bivariate data two variables are assigned to each item, for example the age and mileage of second hand cars. The data are described as multivariate when more than two variables are involved. Multivariate data may be stored on a spreadsheet with different fields used for the different variables associated with each item. Key words Categorical Data Common ways of displaying categorical data are pictograms, bar charts, dot plots and pie charts. These are illustrated for the following data collected from a survey of bats at

a particular place one evening. Chapter 15.1 Key words Chapter 15.1 Key words Chapter 15.1 Key words Chapter 15.1 Key words Chapter 15.1 Key words Chapter 15.2 Ranked Data

In God we trust. All others must bring data. - W Edwards Deming In statistics, data are sometimes ranked in order of size and the ranks are used in preference to the original values. Table 15.3 gives the times in minutes and the ranks of 15 athletes doing a half-marathon. Chapter 15.2 Key words Ranked data gives rise to some useful summary measures. The median is the value of the middle item. If there are n items, the median is that with rank In the example of the 15 athletes n = 15, so the median is the time of number The athlete ranked number 8 is Susmita and her time is 101 minutes, so that is the median. ? You have to be aware when working out the median as to whether n is odd or even. If it

is odd, as on the example in the previous slide, works out to be a whole number but if n is even that is not the case. For example if . In that case the data set does not have single middle value; those ranked 10 and 11 are equally spaced either side of the middle and so the median is half way between their values. The median is a typical middle value and so is sometimes called an average. More formally it is a measure of central tendency. It usually provides a good representative value. The median is easy to work out if the data are stored on a spreadsheet since that will do the ranking for you. Notice that extreme values have little, if any, effect on the median. It is described as resistant to outliers. It is often useful when some data values are missing but can be estimated. Chapter 15.2 Key words The median divides the data into two groups, those with high ranks and those with low ranks. The lower quartile and the upper quartile are the middle values for these two groups so between them the two quartiles and the median divide the data into four

equal sized groups according to their ranks. These three measures are sometimes denoted by Q1, Q2 and Q3. Q2 is the median. Quartiles are used mainly with large data sets and their values found by looking at the and points. So for a data set of, say, 1000 you would often take Q1 to be the value of the 250th data item, Q2 that of the 500th item and Q3 the 750th item. There is no standard method or formula for finding Q1 and Q3 and you may meet different strategies. The method used here is consistent with the output from some calculators; it displays the quartiles of a data set and depends on whether the number of items, n, is even or odd. Chapter 15.2 Key words If n is even then there will be an equal number of items in the lower half and upper half of the data set. To calculate the lower quartile, Q1, find the median of the lower half of the data set. To calculate the upper quartile, Q3, find the median of the upper half of the data set. For example, for the data set {1, 3, 6, 10, 15, 21, 28, 36, 45, 55};

The median, Q2, is The lower quartile, Q1, is the median of {1, 3, 6, 10, 15}, i.e. 6. ? The upper quartile, Q3, is the median of {21, 28, 36, 45, 55}, i.e. 36. If n is odd then define the lower half to be all data items below the median. Similarly define the upper half to be all data items above the median. Then proceed as if n were even. For example, for the data set {1, 3, 6, 10, 15, 21, 28, 36, 45}: The median, Q2, is 15. The lower quartile, Q1, is the median of {1, 3, 6, 10}, i.e. . The upper quartile, Q3, is the median of {21,?28, 36, 45}, i.e. Chapter 15.2 Key words Catherine is a junior reporter at the Avonford Star. As part of an investigation into consumer affairs she purchases 0.5 kg of lean mince from 12 shops and supermarkets in the town. The resulting data, with the prices in rank order, are as follows: 1.39, 1.39, 1.46, 1.48, 1.48, 1.50, 1.52, 1.54, 1.60,

1.66, 1.68, 1.72 Find Q1, Q2 and Q3. Chapter 15.2 Key words The simplest measure of spread for ranked data is the range, the difference between the values of the highest and lowest ranked items. The range is often not a very good measure as it is unduly affected by extreme values or outliers. A better measure is the interquartile range (also sometimes called the quartile spread), which is the difference between the two quartiles. This tells you the difference between typically high and typically low values. Another measure of spread that is used with ranked data is the semi-interquartile range. This is just half of the interquartile range and is a measure of how far above or below the middle a typical large or small data item lies. Semi-interquartile range is comparable to standard deviation which you will meet later in the chapter. Interquartile range is sometimes used to identify possible outliers. A standard procedure is to investigate items that are at least 1.5 interquartile range above or below the nearer quartile.

Chapter 15.2 Key words What is the range and interquartile range of the times of the athletes? Chapter 15.2 Which is the more representative measure of spread? Key words Range = 182 74 = 108 minutes. IQR = 119 77 = 42 minutes. IQR is better as range was unduly influenced by Sally who took much longer than everyone else. The median and quartiles are not the only common divisions used with ranked data. For example, the percentiles lie at the percentage points and are widely used with large data sets. Q1 is the 25th percentile, Q2 is the 50th percentile and Q3 is the 75th percentile.

So the 90th percentile is the th data value when the data is in rank order. If this value is not an integer, then choose the next value up. For example, when n=25, the 90th percentile is the data value. 90% of the data has a value less than the 90th percentile. Chapter 15.2 Key words A box-and-whisker plot, or boxplot, is often used to display ranked data. This one shows the times in minutes of the half-marathon athletes. Box-and-whisker plots are sometimes drawn horizontally, like the one above, and sometimes vertically. They give an easy-to-read representation of the location and spread of a distribution. The box represents the middle 50% of the distribution and the whiskers stretch out to the extreme values. You may be asked to use and to identify outliers. You will be given these formulae. In this case the interquartile range is 108 77 = 31 and 1.5 31 = 47.5. At the lower end, 74 is only 3 below 77 so nowhere near being an outlier. However at the upper end 182 is 74 above 108 and so needs to be investigated as an outlier. The question why did Sally

take so long? clearly needs to be looked into. ? Some software identifies outliers for you. Chapter 15.2 Key words Examples The diameters of 11 different Roman coins are measured in centimetres: 2.2 2.5 2.7 2.7 2.8 3.0 3.1 3.2 3.6 4.0 4.7 Determine the quartiles and hence any outliers. 3rd item 9th item Lower outlier boundary ? Upper outlier boundary Therefore outlier is 4.7cm. Test Your Understanding

The ages of 15 Lib Dem MPs are given: 11 18 20 27 30 31 32 32 35 36 37 58 63 78 104 a) If an outlier is considered to be 1.5 interquartile ranges below the lower quartile or above the upper quartile, determine any outliers. Boundaries: ?a Therefore only the 11-year-old is an outlier. Box Plot Example Smallest values Largest values Lower Quartile Median

Upper Quartile 0, 3 21, 27 8 10 14 Draw a box plot to represent the above data. Exam Tip: You MUST show your outlier boundary calculations. Outlier boundaries:

? Use a cross for each outlier. The maximum value not an outlier is 21 ? 0 5 10 15 20

25 30 Test Your Understanding (b) The company claims that for 75% of the months, the amount received per month is greater than 10 000. Comment on this claim, giving a reason for your answer. (2) Test Your Understanding a? (b) The company claims that for 75% of the months, the amount received per month is greater than 10 000. Comment on this claim, giving a

reason for your answer. (2) b? Comparing Box Plots Box Plot comparing house prices of Croydon and Kingston-upon-Thames: Croydon Kingston 400k 450k 500k 550k

600k 650k 700k 750k Compare the prices of houses in Croydon with those in Kingston. (2 marks) For 1 mark, one of: In interquartile range of house prices in Kingston is greater than Croydon. The range of house prices in Kingston is greater than Croydon. ? Include some measure of spread.

For 1 mark: The median house price in Kingston was greater than that in Croydon. ? Include some measure of location (median is best). 1. In a school of 1000 pupils, 50 were asked what their favourite type of music was. There are 180 pupils in Year 7, 200 in Year 8, 240 in Year 9, 220 in Year 10 and 160 in Year 11. How would you create an accurate a sample for this survey? 2. Complete the following statements using the symbols , and a) The shape is a triangle the shape has 3 sides b) x is 12 x has a factor of 3 c) y = x2 3x + 7 y has no real roots

3. Simplify without using a calculator 1. Allocate a number from 1 to n for each child within each year group. Use a random number generator to select 9 pupils from year 7, 10 pupils from year 8, 12 pupils from year 9, 11 pupils from year 10 and 8 pupils from year 8. a) The shape is a triangle the shape has 3 sides b) x is 12 x has a factor of 3 c) y = x2 3x + 7 y has no real roots Chapter 15.3 Discrete Numerical Data In God we trust. All others must bring data. - W Edwards Deming You will sometimes collect or find yourself working with discrete numerical data. Your

first step will often be to sort the data in a frequency table. A tally can be helpful when you are doing this. Here, for example, are the scores of the teams in the various matches in the 2014 Football World Cup. Chapter 15.3 Key words When would you use a tally chart and when would you use a stem-and-leaf diagram to help you sort raw data? Several measures of central tendency are commonly used for discrete numerical data: the mode, the mean, and also the median. The mode is the value that occurs most frequently. If two non-adjacent values occur more frequently than the rest, the distribution is said to be bimodal, even if the frequencies are not the same for both modes. Bimodal data may indicate that the sample has been taken from two populations. For example, the heights of a sample of students (male and female) would probably be bimodal reflecting differences in heights depending on the sex of the person. For a small set of discrete data the mode can often be misleading, especially if there are

many values that the data can take. Several items of data can happen to fall on a particular value. The mode is used when the most probable or most frequently occurring value is of interest. For example a dress shop manager who is considering stocking a new style might first buy dresses of the new style in the modal size as she would be most likely to sell those. The mean is found by adding the individual values together and dividing by the number of individual values. This is often just called the average although strictly it is the arithmetic mean. Chapter 15.3 Key words Chapter 15.3 Key words Find the mean number of goals scored in this competition. Chapter 15.3 Key words

Notice the notation that has been used. The value of the variable, in this case the number of goals, is denoted by x. The frequency is denoted by f. Summation is indicated by the use of the symbols (sigma). The total number of data items is n so f = n. The mean value of x is denoted by and So the mean of the number of goals per team per match in the 2014 World Cup finals was 1.37. Another measure of central tendency is the median. As you have already seen, this is used for ranked data. The football scores are not ranked but it is easy enough to use the frequency table to locate the median. There are 128 scores and since 128+1 2 = 64.5, the median is half way between number 64 and number 65. This table gives the ranks of the various scores; it has been turned upside down so that the highest score comes first at the top of the table. Chapter 15.3 Key words

Clearly numbers 64 and 65 are both in the row representing 1 goal, so the median is 1. You could also write the last column as cumulative frequency as well The simplest measure of spread for this type of data is the range, the difference between the highest value and the lowest. If you rank the data you can find the quartiles and so the interquartile range. If the data are in a frequency table you can find the quartiles using a similar method to that used above for the median. However, the most widely used measure of spread is standard deviation and this is covered later in the chapter. In the case of the goals scored by teams in the World Cup final the variable has only a small number of possible values and they are close together. In a case like this, the most appropriate diagram for displaying the data is a vertical line chart. Chapter 15.3 Key words Sometimes, however, your data will be spread out over too wide a range for a vertical line chart to be a helpful way of displaying the information. In that case a stem-and-leaf

diagram may be better. The diagram below shows the results for 80 people taking an aptitude test for becoming long-distance astronauts; the maximum possible score is 100. Chapter 15.3 Key words A stem-and-leaf diagram provides a form of grouping for discrete data. The data in the previous diagram could instead be presented in the grouped frequency table given below. Chapter 15.3 Key words How would you estimate the mean for this grouped data? Grouping means putting the data into a number of classes. The number of data items falling into any class is called the frequency for that class. When numerical data are grouped, each item of data falls within a class interval lying between class boundaries.

Chapter 15.3 Key words The main advantage of grouping is that it makes it easier to display the data and to estimate some of the summary measures. However, this comes at a cost; you have lost information, in the form of the individual values and so any measures you work out can only be estimates. The easiest way to display grouped discrete data is to use a bar chart with the different classes as categories and gaps between them, as shown below. Every item of data must belong to exactly one class. There must be no ambiguity. Always check your class intervals to make sure that no classes overlap and that there are no gaps in

between them. Chapter 15.3 Key words Different types of distribution are described in terms of the position of their modes or modal groups. Chapter 15.3 Key words Unimodal and Symmetric Uniform Bimodal When the mode is off to one side the distribution is said to be skewed. If the mode is to the left with a long tail to the right the distribution has positive skew.

Chapter 15.3 Key words If the mode is to the right with a long tail to the left the distribution has negative skew. When would you want to group discrete data? There may be too much data for it to be practical to enter each item on a stem and leaf diagram and the variable may take too many values for a vertical line chart to be suitable. For example, the marks obtained by candidates on a public examination with an entry of several thousand people. The data may cover too large a number of possible values of the variable. For example the cost in of second hand cars. The data may be spread out over too wide a range. For example, peoples winnings in in the National Lottery one Saturday. Discuss whether you would group your data in these cases. You have conducted a survey of the number of passengers in cars using a busy road at commuter time. You have carried out a survey of the salaries of 200 people five years after leaving university. It is July and you have trapped and weighed a sample of 50 small mammals, recording their weights to the nearest 10 grams.

It would be easier to use a tally chart to collect data on passenger numbers using 0, 1, 2, 3, 4, 5 and 6+ passengers so a frequency table is appropriate. It would be sensible to group the data on salaries, so a grouped frequency table is appropriate. The data set is fairly small so you could work with the ungrouped data. Chapter 15.3 Key words Chapter 15.4 Continuous Numerical Data In God we trust. All others must bring data. - W Edwards Deming When your data are continuous you will almost always need to group them. This includes two special cases. The variable is actually discrete but the intervals between values are very small. For example, cost in is a discrete variable with steps of 0.01 (that is, one penny) but this is so small that the variable may be regarded as continuous.

The underlying variable is continuous but the measurements of it are rounded (for example, to the nearest mm), making your data discrete. All measurements of continuous variables are rounded and providing the rounding is not too coarse, the data should normally be treated as continuous. A particular case of rounding occurs with peoples age; this is a continuous variable but is usually rounded down to the nearest completed year. There are two ways of displaying continuous grouped data using vertical bars. They are frequency charts and histograms. In both cases there are no gaps between the bars. Chapter 15.4 Key words Frequency Charts A frequency chart is used to display data that are grouped into classes of equal width. This chart illustrates the lengths of the reigns of English kings and queens from 827 to 1952 in completed years. Chapter 15.4 Key words As you can see, the vertical axis represents the frequency.

The horizontal scale is marked 0, 10, 20, . . . , 70 and labelled Years. There are no gaps between the bars. Frequency Charts You can also use a frequency polygon to display grouped continuous data. To draw a frequency polygon: Plot the frequencies against the midpoint of each class Join the points up with straight lines. Although frequency polygons are quite often used, you should treat them with caution. In this diagram the point (50, 2) is on a line segment, but it is not true to say that 2 monarchs had a 50 year reign. The point (55, 3) marked with a red cross does not imply that 3 monarchs had a 55 year reign. It is actually the case that 3 monarchs reigned for between 50 and 60 years.

What is the shape of the distribution shown in the last two diagrams? Positive Skew Chapter 15.4 Key words Histograms You can only use a frequency chart if all the classes are of equal width. If the classes are not of equal width you must use a histogram. There are two key differences between histograms and frequency charts. In a histogram, frequency is represented by the area of a bar and not by its height. The vertical scale of a histogram represents frequency density not frequency. You can see these points in the following example. The heights of 80 broad bean plants were measured, correct to the nearest centimetre, 10 weeks after planting. Draw a histogram to display these data.

Chapter 15.4 Key words Histograms The first step is to add two more columns to the frequency table, one giving the class width and the other giving the frequency density. Chapter 15.4 Key words Now draw the histogram with height on the horizontal axis and frequency density on the vertical axis. Chapter 15.4 Key words There are two things to notice about this histogram. As an example, look at the bar for 15.5 h < 17.5. Its width is 2 and its height is 5.5 so its area is 2 5.5 = 11 and this is the frequency it represents. The horizontal scale is in centimetres so the vertical scale is frequency per centimetre. The

horizontal and vertical scales of a histogram are linked in such a way that multiplying them gives frequency. In this case Estimating summary measures from grouped continuous data As part of an on-going dispute with her parents, Emily times all calls during August on her mobile phone. (i) Estimate the mean length of her calls. (ii) Estimate the median. (iii) Find the modal class. Chapter 15.4 Key words To complete all parts of this question, four more columns need to be added Chapter 15.4 Key words (i) To complete the estimation of the mean you need the new third and fourth columns, mid-value and t f.

(ii) The median is the th data value. There are 79 data values so the median is the 40th value. To find the median it is helpful to find the cumulative frequency. Chapter 15.4 Key words So the median is (iii) To find the modal class you need a histogram. It is the class with the highest bar. Chapter 15.4 Key words The highest bar is the second one. So the modal class is from 30 to 60 seconds. Notice that the class with the highest frequency is 60 to 120 but that has a greater class width. Did you actually need to draw the histogram?

No, just work out the frequency densities and choose whichever is highest. Estimating summary measures from grouped continuous data Robert is a market gardener and sells his produce to a supermarket. He collects sample data about the weight of the tomatoes he plans to sell. He wants to know the mean value, but he also needs to know about the distribution as the supermarket will not accept any that are too small or too large. (i) Roberts first thought is to describe the intervals, in grams, as 5759, 5961, and so on. What is wrong with these intervals? (ii) Robert then says, I am going to record my measurements to the nearest 1 g and then denote them by m g. My classes will be 57 < m 59, 59 < m 61, and so on. What is the mid-value, x g, of the first interval? (iii) Robert has six intervals. The last one is 67 < m 69. Their frequencies, in order, are 4, 11, 19, 8, 5, 3. Estimate the mean weight of the tomatoes. (iv) Estimate the 10th percentile to the 90th percentile range. Chapter 15.4 Key words

(i) It is not clear which group some tomatoes should be in, for example one of mass 59 g, as the intervals overlap. (ii) The lower bound of the weight for the group 57 < m 59 is 57.5 g. The upper bound of the weight for the group 57 < m 59 is 59.5 g. So the mid-value of the first interval is 58.5 g. (iii) Chapter 15.4 Key words Estimated mean = , so 63g to the nearest whole number. (iv) The 10th percentile is the value this is the 1st value in the class 59 < m 61 So the 10th percentile is You need the first of the 11 values which you assume are

distributed evenly throughout the class width of 2 The 90th percentile is the value this is the 3rd value in the class 65 < m 67 So the 90th percentile is Key words You need the third of the 5 values which you assume are distributed evenly throughout the class width of 2 So the interpercentile range is 66.7g 59.7g = 7g Chapter 15.4

Cumulative frequency curves To estimate the median and quartiles and related measures it can be helpful to use a cumulative frequency curve. Cumulative frequency curves may also be used with other types of data but they are at their most useful with grouped continuous data. Chapter 15.4 Key words Cumulative frequency curves After receiving this letter the editor wondered if there was a story in it. She asked a student reporter to carry out a survey of the prices of textbooks in a big bookshop. The student reporter took a large sample of 470 textbooks and the results are summarised on the next slide. Chapter 15.4 Key words The student reporter decided to estimate the median, the upper quartile and the lower quartile of the prices. The first steps were to make a cumulative frequency table and to use it to draw a cumulative frequency curve.

Cumulative frequency curves Chapter 15.4 Key words To draw the cumulative frequency curve you plot the cumulative frequency (vertical axis) against the upper boundary of each class interval (horizontal axis). Then you join the points with a smooth curve. It is usual in a case like this, based on grouped data, to estimate the median as the value of the term and the quartiles as the values of terms and . In this case there are 470 items, so the median is at number 235 and the quartiles at 117.5 and 352.5. As you can see from the graph, these give the following values. Lower quartile, Q1 17.70

Median, Q2 22.50 Upper quartile, Q3 27.80 The cumulative frequency scale on the graph goes from 0 to 470, the middle is at , so it is this value that is used to estimate the median. Chapter 15.4 Key words 1. Define the following a) Discrete variable b) Qualitative data c) Ranked data 2. Prove that the difference between the squares of any two consecutive integers is equal to the sum of those integers. 3. Rationalise

1. . a) They can take certain values but not those in between b) Categorical data that can be described without numbers c) Ranked data are given their position within a group rather than their actual measurement 2. 3. - Chapter 15.5 Bivariate Data In God we trust. All others must bring data. - W Edwards Deming The table below shows the final results for the Premiership football teams in the 201314 season. The data in the table

are multivariate. There are nine columns; each row represents one data item covering nine variables. What are the types of the nine variables? Place is ranked, Team is categorical and the other 7 variables are all discrete numerical data. Tom is investigating the relationship between Goals For (GF) and Points (PTS). So his data items can be written (102, 86), (101, 84) and so on. These items cover just two variables and so the data are described as bivariate. Chapter 15.5 Key words

Bivariate data are often displayed on a scatter diagram like Figure 15.35. Looking at the spread of the data points, it is clear that on the whole teams scoring many goals tend to have high points totals. There is a high level of association between the two variables. If, as in this case, high values of both variables occur together, and the same for low values, the association is positive. If, on the other hand, high values of one variable are associated with low values of the other, the association is negative. Does this scatter diagram show any of the teams to be outliers? Man U and Liverpool are quite a long way from the rest of the teams so could be considered outliers. To be safe you should check they are correctly shown on the scatter diagram Chapter 15.5 Key words

A line of best fit is often drawn through the points on a scatter diagram. If the points lie close to a straight line the association is described as correlation. So correlation is linear association. You can use the statistical functions on your calculator to find the equation of the regression line this is the name of the calculated line of best fit. The equation of the regression line in the graph to the left is y=0.8376x + 9.04. You need to be able to find regression lines using your calculator. The equation shows that for every goal scored an extra 0.8376 points were awarded. You can use the regression line to make predictions, but it can be unreliable outside the range of the data set. The equation of the regression line in the graph to the left is y=0.8376x + 9.04. Use the regression line to estimate: 1. The goals for a team awarded 60 points. 2. The points earned by a team that scores 5 goals. Comment on the reliability of your estimate in each case. 1. Y = 0.8376x + 9.04 60 = 0.8376x + 9.04 x = 60.8 so the team scored 61 goals.

This answer is reliable as it is within the range of the data. 2. The model is less reliable when used to estimate the points earned by a team that has scored only 5 goals. y=0.8376x + 9.04 y=0.8376 5 + 9.04 y=13.2 The model predicts that the team would have 13 points, but this might not be accurate it might be that a team that does not score many goals also has lots of goals scored against them and so loses more games and scores even fewer points than the model suggests. Or it might be that such a team is involved in more equally matched games and there are more 0-0 draws and so it scores more points than the model suggests. Chapter 15.5 Key words The scatter diagram was drawn with the goals scored on the horizontal axis and the points total on the vertical axis. It was done that way to emphasise that the number of points is dependent on the number of goals scored. (A

team gains points as a result of scoring goals. It does not score goals as a result of gaining points.) It is normal practice to plot the dependent variable on the vertical (y) axis and the independent variable on the horizontal (x) axis. The table below shows some more examples of dependent and independent variables. Chapter 15.5 Key words MODE 6: Statistics Select a mode: Two Variables (X, Y) Use when you have a scatter diagram, e.g. hours revised against test score. 1 3

2 6 3 5 4 8 At A Level, when there are two variables, we measure linear correlation or use linear regression. Thus choose . Enter the left table in a similar manner paying attention to which variable is on the x and y axis.

Once the data is entered press OPTN then choose 2-Variable Calc to obtain a list of all statistics such as , etc. or Regression Calc to obtain (i.e. the coefficients of your line of best fit and the PMCC). At this stage you do not need the r value. Using the table above enter the data for goals for and pts into your calculator and write down the equation of the line of best fit. Random and non-random variables In the examples on the last slide, both the variables have unpredictable values and so are random. The same is true for the example about the goals scored and the points totals in the premier league. Both variables are random variables, free to assume any of a particular set of values in a given range. All that follows about correlation will assume that both variables are random. Sometimes one or both variables is or are controlled, so that the variable only takes a set of predetermined values; for example, the times at which temperature measurements are taken at a meteorological station. Controlled variables are independent and so are usually plotted on the horizontal axis of scatter diagrams.

Chapter 15.5 Key words Interpreting scatter diagrams Assuming both variables are random, you can often judge whether they are correlated by looking at the scatter diagram. This will also show up some common situations where you might think there is correlation but would be incorrect. When correlation is present you expect most of the points on the scatter diagram to lie roughly within an ellipse. Chapter 15.5 Key words Interpreting scatter diagrams Chapter 15.5 Key words Interpreting scatter diagrams

Chapter 15.5 Key words Summary measures for bivariate data Association and correlation There are two summary measures that you may meet when using statistical software or in other subjects. Spearmans rank correlation coefficient is a measure of association that is used when both variables are ranked. Its interpretation depends on the sample size. It is used as a test statistic. Pearsons product moment correlation coefficient is a measure of correlation. It is given by most statistical software. It is used as a measure in its own right and as a test statistic. In both situations its interpretation depends on the sample size. Lines of best fit When bivariate data are displayed on a scatter diagram you often need to draw a line of best fit through the points. You can do this roughly by eye, trying to ensure that about the same numbers of points lie above and below the line. If you know the mean values of the two variables, know

the coordinates ofbest the mean point and should thethat line itofmakes best fitsense, so Always beyou careful when using a line of fit or regression line; bedraw certain that it passes through it. However, line are of best fit by

eye is a somewhat haphazard givendrawing the dataayou dealing with. process and in slightly more advanced use it is common to calculate its equation using the least squares regression line. This is a standard output from statistical software and can also be found from many spreadsheets. Chapter 15.5 Key words Chapter 15.6 Standard Deviation In God we trust. All others must bring data. - W Edwards Deming You have already met range and interquartile range as measures of the spread of a data set.

Sometimes neither of these will quite meet your requirements. The range does not use all the available information, only the extreme values which may well be outliers. In most situations this is unsatisfactory, although in quality control it can be an advantage as it is very sensitive to something going wrong, for example on a production line. The interquartile range requires that either the data are ranked, or else that an estimate is made, for example using a cumulative frequency curve. Very often you want to know how much a typical value is above or below a central value such as the mean. Chapter 15.6 Key words A new measure of spread Kim and Jo play as strikers for local hockey teams. The selectors for the county team want to choose one of them. These are their goal-scoring records over the last 10 matches. Kim 0 Jo 2 0

1 0 1 3 0 0 0 1 2 0 1 0 0

1 1 5 2 Who should be selected? The means of their numbers of goals per match should give an indication of their overall performance. They have both scored 10 goals so have the same mean of 1.0 goals per match. That does not help the selectors. The spread of their numbers of goals could give an indication of their reliability. Start by finding how far each score is from the mean. This is the deviation, (x ). Here is Kims data: Chapter 15.6 Key words A new measure of spread

The total of the deviations is 0. Instead you want the absolute value of the deviations. Whether it is positive or negative, it is counted as positive. It is denoted by . You can see this in the table below. Remember in this case . Chapter 15.6 Key words A new measure of spread The next step is to find the means of their absolute deviations. Chapter 15.6 For Kim it is . Calculate the mean absolute deviation for Jo remember the mean for Jo was also 1 goal per game Jo

2 1 1 0 0 2 1 0 1

2 For Jo the mean absolute deviation works out to be . So there is a greater spread in the numbers of goals Kim scored. ? Which of the two players should be picked for the county team? If the selectors want a player with a steady reliable performance, they probably should select Jo; on Key words A new measure of spread Mean absolute deviation is given by . It is an acceptable measure of spread but it is not widely used because it is difficult to work with. Instead the thinking behind it is taken further with standard deviation which is more important mathematically and consequently is very widely used. To work out the mean absolute deviation, you had to treat all deviations as if they were positive. A different way to get rid of the unwanted negative signs is to square the deviations. For Kims data this

is shown in the table below. Chapter 15.6 Key words So the mean of the squared deviations is . This measure of spread is called the variance, which when square-rooted gives the standard deviation. A new measure of spread The standard deviation is a measure of spread and has the same units as the original data. In this example the standard deviation for Kims data is goals to 3sf. It is a measure of how much the value of a typical item of data might be above or below the mean. You should use the standard notation and formulae given below: Chapter 15.6 Standard deviation is by far the most important measure of spread in

statistics. It is used for both discrete and continuous data and with ungrouped and grouped data. Key words Extending to frequency/grouped frequency tables We can just mull over our mnemonic again: Variance: The mean of the squares minus the square of the means (msmsm) 2 2

= ? ? Tip: Its better to try and memorise the mnemonic than the formula itself youll understand whats going on better. Calculate the standard deviation for Jo Jo: 2 1 1 0 0 2 1 0 1 2 Examples 3, 11 2cm 3cm 3cm 5cm 7cm Variance ? Standard Deviation

? So note that that in the case of two items, the standard deviation is indeed the average distance of the values from the mean. Variance cm Standard Deviation cm ? ? Practice Find the variance and standard deviation of the following sets of data. 2 4 6 Variance =

? Standard Deviation = ? 1 23 4 5 Variance = ? Standard Deviation = ? As part of her job as a quality control inspector Stella commissioned these data relating to the lifetime, h hours, of 80 light bulbs produced by her company.

(i) Estimate the mean and standard deviation of the light bulb lifetimes. (ii) What do you think Stella should tell her company? Chapter 15.6 Key words ? ? ? Chapter 15.6 Key words MODE 6: Statistics Select a mode: Single Variable (X) Use when you have just one variable, e.g. height,

weight, shoe size. 2 3 5 5 To enter your data, enter each value and press = after each. If you want a frequency column, press [SHIFT] [SETUP], scroll down to Statistics, then turn Frequency on. This setting will be saved for future use. Once data is entered, press OPTN then choose 1-Variable Calc. This will give you all key statistics (, etc.) at the same time.

Using the same data for light bulbs below can you calculate the mean (), variance ) and standard deviation ( of the data set. Standard deviation is a number used to tell how measurements for a group are spread out from the average (mean), or expected value. A low standard deviation means that most of the numbers are very close to the average. A high standard deviation means that the numbers are spread out. Chapter 15.6 Key words Can standard deviation have a negative value? No it cannot be negative. If you work out a standard deviation and get a negative value then you have made a mistake. You have already seen how interquartile range can be used to identify possible outliers in a data set. Standard deviation can be used in a similar way.

A commonly used check is to investigate all items which lie more than 2 standard deviations from the mean and to decide whether they should be included in your analysis or not. The distributions of many populations are approximately Normal and in such cases this test can be expected to highlight the most extreme 5% of data values. Chapter 15.6 Key words Lets check this for the light bulb question are there any items of data likely to be outliers Example May 2013 Q4 We can use our STATS mode to work out the various summations needed (and 1-Variable Calc will contain this amongst its list). Just input the table as normal. Note that, as per the discussion before, on a calculator actually gives you because its already taking the frequencies into account. . ?

= =. ? ? You ABSOLUTELY must check this with the you can get on the calculator directly. Test Your Understanding May 2013 (R) Q3 ? Coding What do you reckon is the mean height of people in this room? Now, stand on your chair, as per the instructions below.

INSTRUCTIONAL VIDEO Is there an easy way to recalculate the mean based on your new heights? And the variance of your heights? The mean would increase by the height of the chairs. The spread however is ? the same. unaffected thus the variance would remain Starter Suppose now after a bout of stretching you to your limits, youre now all 3 times your original height. What do you think happens to the standard deviation of your heights? It becomes 3 times larger (i.e. your heights are 3 times as spread out!) ? What do you think happens to the variance of your heights? It becomes 9 times larger. We use the scale factor of the standard deviation,

? squared. Extension Question: Can you prove the latter using the formula for variance? ? 48.1, 486.44 and n = 7 Can you calculate the mean, variance and standard deviation of this ungrouped data Representation and Summary of Data - Location Representation and Summary of Data Location Coding You need to understand why data is coded, how to code it and how to un-code it. Coding is done before any average is calculated, and is usually used with large values of data in order to simplify calculations

Once data has been coded, averages are calculated Then after the average is worked out, the code is reversed in order to give the actual average Representation and Summary of Data Location Coding Use the following coding to calculate the mean of the data below 110, 120, 130, 140, 150 x represents the original value Coding y x 100 y is the coded value So this code is telling us to subtract 100 from all the numbers before calculating the mean

10, 20, 30, 40, 50 The mean of these numbers is 30 However as 100 was subtracted, you must now undo this to get the correct mean So the mean of the original set of data is 130 Representation and Summary of Data Location Coding Use the following coding to calculate the mean of the data below 110, 120, 130, 140, 150 x represents the original value Coding y x 100 10

y is the coded value So this code is telling us to subtract 100 from all the numbers, and then divide by 10, before calculating the mean 1, 2, 3, 4, 5 The mean of these numbers is 3 We subtracted 100 then divided by 10.. So to undo this we must multiply by 10 then add 100 So the mean of the original set of data is 130 Representation and Summary of Data Location Coding Time (mins) Calls

Midpoin t, x y fy 0-5 4 2.5 -1 -4 5-10

15 7.5 0 0 10-15 5 12.5 1 5

15-20 2 17.5 2 4 y 16.5 27 20-60 0 40

6.5 0 y 0.61111 60-70 1 65 11.5 11.5 Total 27

Use the following code to estimate the mean of this set of grouped data on the lengths of phonecalls. y x 7.5 5 First the midpoints (x) must be turned into new values (y) using the code. We are now working out the mean, so use the formula for this. fy y f

16.5 Representation and Summary of Data Location Coding y We calculated a mean of 0.61111 using the code x 7.5 5 So we subtracted 7.5 and then divided by 5 We therefore need to multiply by 5 and then add 7.5 (0.61111 x 5) + 7.5 The mean for the original datax( ) is 10.5555 (10.56 to 2dp) Representation and Summary of Data - Dispersion

Representation and Summary of Data Dispersion Coding As with averages, coding can be used to make data easier to work with. However, there is something extra to remember If you have a set of data with a range of 15, and reduce every number by 2, what will happen to the range? Nothing! Range measures the spread of data, and if all the numbers are 2 less, the spread will not have changed It is exactly the same for Standard Deviation. Because it measures the spread of data, any addition/subtraction in the coding will not need to be undone. Any division or multiplication will have to be uncoded as normal 3E Representation and Summary of Data Dispersion Coding Use the following code to calculate the Standard Deviation of this set of data:

150, 160, 170, 180, 190 x y 10 Code 15, 16, 17, 18, 19 x 85 2 x2 n 2 x 1455

x n 1455 85 5 5 2 2 2 2 2 1.41

n5 (2dp) Total x x2 15 225 16 256

17 289 18 324 19 361 85 1455 But we had divided by 10 so we must undo this

x 10 14.14 (2dp) 3E Representation and Summary of Data Dispersion Coding Use the following code to calculate the Standard Deviation of this set of data: 150, 160, 170, 180, 190 y x 100 Code 50, 60, 70, 80, 90

x 350 2 x2 n 2 x 25500 x n n5 2

25500 350 5 5 Total x x2 50 2500 60

3600 70 4900 80 6400 90 8100 350 25500 2

2 2 200 14.14 (2dp) We do not need to undo as we only 3E Representation and Summary of Data Dispersion Coding Use the following code to calculate the Standard Deviation of this set of data: 150, 160, 170, 180, 190 x 100

y 10 Code 5, 6, 7, 8, 9 x 35 2 x2 n 2 x 255 x

n 255 35 5 5 n5 2 2 2 Total x

x2 5 25 6 36 7 49 8 64 9

81 35 255 We only need to undo the divide by 10 2 2 1.41 (2dp) x 10 14.14

(2dp) 3E Representation and Summary of Data Dispersion Coding Use the code below to calculate the Standard Deviation of this table of data. y Code 2 fy 2

fy f f 34.25 26 2 x 7.5 5 11.5 26

Calls, f Midpoin t, x y fy fy2 0-5 4 2.5 -1

-4 4 5-10 12 7.5 0 0 0 10-15 6

12.5 1 6 6 15-20 3 17.5 2 6

12 20-30 1 25 3.5 3.5 12.25 Total 26 11.5

34.25 (f)fy) (f)fy2) 2 2 2 1.12 1.06 (2dp) x5 Call

length Undo the divide by 5 only 5.29 (2dp) (f)f)