Scatterplots and Correlation
So far we have been examining one variable at a time. In practice, we often want to look at several variables at once. In this chapter, we will specifically consider how to analyze two quantitative variables.
A response variable measures an outcome of a study.
An explanatory variable may explain or influence changes in a response variable.
Ex. Suppose that individuals are given different amounts of alcohol, and then reaction times for a particular activity are measured.
Often explanatory variables are called independent variables, and response variables are called dependent variables.
Note that a cause-and-effect relationship may or may not exist, but we cannot determine causality.
Two variables measured on the same individual are associated if some values of one variable tend to occur with some values of the second variable more than with other values of that variable.
Displaying Relationships: Scatterplots
A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal or x axis, and the values of the other variable appear on the vertical or y axis. Each individual in the data appears as a point in the plot.
If there is an explanatory variable and a response variable, the explanatory variable goes on the horizontal axis and the response variable on the vertical axis. If such a distinction cannot be made, ten either variable can go on either axis.
To interpret a scatterplot, look for the overall pattern and for striking deviations from that pattern. To describe the overall pattern, look at the (1) form, (2) direction, and (3) strength of the relationship. Also look for any outliers.
Two variables are positively associated when above-average values of one tend to accompany above-average values of the other, and below-average values also tend to occur together.Ex. In a large group of people, there will be a positive association between height and weight.
Two variables are negatively associated when above-average values of one tend to accompany below-average values of the other, and vice versa.
Ex. In a large group of people, there will be a negative association between packs of cigarettes smoked and length of life.
Ex. Create a scatterplot to show the relationship between yearly average temperature and number of fires and yearly average temperature and area burned.
year average temperature (oF) number of fires acres burned (in millions)
2000 54.52 92250 7.39
2001 52.19 84079 3.57
2002 53.74 73457 7.18
2003 53.1 63629 3.96
2004 53.6 65461 8.1
2005 53.08 66753 8.69
2006 54.38 96385 9.87
2007 53.43 85705 9.33
2008 53.04 78979 5.29
2009 52.83 78792 5.92
2010 52.06 71971 3.42
2011 52.82 74126 8.71
Source: fire data from http://wildland-fires.sciencedaily.com/#
temperature data from http://www.ncdc.noaa.gov/temp-and-precip/time-series/index.php?parameter=tmp&month=5&year=2000&filter=12&state=110&div=0
In Excel, highlight the two variables of interest. Click Insert -> Scatter and select the appropriate chart type.
Ex. Fuel used vs. Speed
How does the fuel consumption of a car change as its speed increases?
Speed vs. fuel consumption per 100 km travelled for British Ford EscortDescribe the form of the relationship. Explain why the form makes sense.Does it make sense to describe the variables as either positively or negatively associated? Why?
Measuring Linear Association: Correlation
We will look at one numerical measure of association, the correlation coefficient. Technically, correlation only makes sense when both variables are quantitative.
The correlation describes the direction and strength of a linear relationship between two quantitative variables. The correlation coefficient is usually written as r, the Pearson product-moment correlation coefficient.
Now lets learn how to calculate r. We will compute r based upon n observations on variables x and y: and . We denote this rXY, the correlation between X and Y.
Each observation is an ordered pair (). For example, and might be my age and my number of college hours earned.
Calculating the correlation coefficient List the two values for each individual.
Compute the sum of X values, and compute the sum of Y values.
Square the X values.
Square the Y values.
Find the sum of the XY products.
Plug these values into the formula.Ex. Calculate rXY by hand.X Y X2 Y2 XY
Ex. Using Excel, find r for yearly average temperature and number of fires and yearly average temperature and area burned.
Note: The columns for the variables of interest must be next to each other.
Using the CORREL function: In the cell you want to display the correlation coefficient, type = CORREL(array1, array2).
array1 contains data for the X variable
array2 contains data for the Y variable