Why you should graph data

Amended 10 March 2016 (corrections/ update made)

Anscombe's Quartet: Click to enlarge
Click to enlarge

I came across Anscombe’s Quartet on Wikipedia recently. I must confess to not having seen it before and don’t recall seeing it in any introductory statistics books.

The Anscombe’s Quartet is a conceptually and graphically clear way of showing the importance of graphs in statistical analysis. Each of the 11 pairs of observations have the same, x mean, y mean, x variance, y variance, correlation co-efficient and regression equation, though each have very different distributions. They clearly demonstrate the impact of outliers and how non-linear relationships can be identified.

Citation:

F. J. Anscombe (1973) Graphs in Statistical Analysis The American Statistician , Vol. 27, No. 1 (Feb., 1973), pp. 17-21

Article Stable URL: http://www.jstor.org/stable/2682899 (Not open access)

 

LaTeX code below.

\documentclass{article}
\usepackage{pgfplots}
\usepackage{pgfplotstable}
\pgfplotsset{compat=1.7}
\usepackage{amssymb, amsmath}
\usepackage{subcaption}
\begin{document}
\begin{figure}
\caption{Anscombe's quartet is a good demonstration why a scatterplot is so valuable, prior to calculating regression equations and correlation co-efficients. In all four cases the $x's$ have a mean of 9, and variance of 11. The mean of all the $y's$ is 7.5, and a variance 4.125. The correlation co-efficient of each is 0.816 and the linear regression line is $y=3+0.5x $}
\begin{subfigure}{.45 \textwidth}
\centering
\caption{Normal linear relationship}
\begin{tikzpicture}
\begin{axis} [width=5cm, height=5cm, xlabel=X1, ylabel=Y1]
\addplot[scatter, only marks, mark=x, mark size=4pt]
coordinates
{
(10, 8.04)
(8.0, 6.95)
(13, 7.58)
(9, 8.81)
(11, 8.33)
(14, 9.96)
(6, 7.24)
(4, 4.26)
(12, 10.84)
(7, 4.82)
(5, 5.68)
};
\addplot[scatter, mark=.]
coordinates
{
(0, 4.1)
(20, 12.5)
};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\begin{subfigure}{.45 \textwidth}
\centering
\caption{Relationship clear, but not linear}
\begin{tikzpicture}
\begin{axis}[width=5cm, height=5cm, xlabel=X2, ylabel=Y2]
\addplot[scatter, only marks, mark=x, mark size=4pt]
coordinates
{
(10, 9.14)
(8.0, 8.14)
(13, 8.74)
(9, 8.77)
(11, 9.26)
(14, 8.10)
(6, 6.13)
(4, 3.1)
(12, 9.13)
(7, 7.26)
(5, 4.74)
};
\addplot[scatter, mark=.]
coordinates
{
(0, 4.1)
(20, 12.5)
};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\
\begin{subfigure}{.45 \textwidth}
\centering
\caption{Clear linear relationship, but one outlier offsets the regression line}
\begin{tikzpicture}
\begin{axis} [width=5cm, height=5cm, xlabel=X3, ylabel=Y3]
\addplot[scatter, only marks, mark=x, mark size=4pt]
coordinates
{
(10, 7.46)
(8.0, 6.77)
(13, 12.74)
(9, 7.11)
(11, 7.81)
(14, 8.84)
(6, 6.08)
(4, 5.39)
(12, 8.15)
(7, 6.42)
(5, 5.73)
};
\addplot[scatter, mark=.]
coordinates
{
(0, 4.1)
(20, 12.5)
};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\begin{subfigure}{.45 \textwidth}
\centering
\caption{Clear relationship, but one outlier puts the regression line at 45 degrees to the other 10 observations}
\begin{tikzpicture}
\begin{axis} [width=5cm, height=5cm, xlabel=X4, ylabel=Y4]
\addplot[scatter, only marks, mark=x, mark size=4pt]
coordinates
{
(8, 6.58)
(8.0, 5.76)
(8, 7.71)
(8, 8.84)
(8, 7.04)
(8, 5.26)
(19, 12.5)
(8, 5.56)
(8, 7.91)
(8, 6.89)
(8, 6.89)
};
\addplot[scatter, mark=.]
coordinates
{
(0, 4.1)
(20, 12.5)
};
\end{axis}
\end{tikzpicture}
\end{subfigure}
\end{figure}
\end{document}

  • Twitter
  • del.icio.us
  • Digg
  • Facebook
  • Technorati
  • Reddit
  • Yahoo Buzz
  • StumbleUpon