Amended 10 March 2016 (corrections/ update made)

I came across Anscombe’s Quartet on Wikipedia recently. I must confess to not having seen it before and don’t recall seeing it in any introductory statistics books.

The Anscombe’s Quartet is a conceptually and graphically clear way of showing the importance of graphs in statistical analysis. Each of the 11 pairs of observations have the same, x mean, y mean, x variance, y variance, correlation co-efficient and regression equation, though each have very different distributions. They clearly demonstrate the impact of outliers and how non-linear relationships can be identified.

Citation:

F. J. Anscombe (1973) Graphs in Statistical Analysis* The American Statistician* , Vol. 27, No. 1 (Feb., 1973), pp. 17-21

Article Stable URL: http://www.jstor.org/stable/2682899 (Not open access)

LaTeX code below.

\documentclass{article}

\usepackage{pgfplots}

\usepackage{pgfplotstable}

\pgfplotsset{compat=1.7}

\usepackage{amssymb, amsmath}

\usepackage{subcaption}

\begin{document}

\begin{figure}

\caption{Anscombe's quartet is a good demonstration why a scatterplot is so valuable, prior to calculating regression equations and correlation co-efficients. In all four cases the $x's$ have a mean of 9, and variance of 11. The mean of all the $y's$ is 7.5, and a variance 4.125. The correlation co-efficient of each is 0.816 and the linear regression line is $y=3+0.5x $}

\begin{subfigure}{.45 \textwidth}

\centering

\caption{Normal linear relationship}

\begin{tikzpicture}

\begin{axis} [width=5cm, height=5cm, xlabel=X1, ylabel=Y1]

\addplot[scatter, only marks, mark=x, mark size=4pt]

coordinates

{

(10, 8.04)

(8.0, 6.95)

(13, 7.58)

(9, 8.81)

(11, 8.33)

(14, 9.96)

(6, 7.24)

(4, 4.26)

(12, 10.84)

(7, 4.82)

(5, 5.68)

};

\addplot[scatter, mark=.]

coordinates

{

(0, 4.1)

(20, 12.5)

};

\end{axis}

\end{tikzpicture}

\end{subfigure}

\begin{subfigure}{.45 \textwidth}

\centering

\caption{Relationship clear, but not linear}

\begin{tikzpicture}

\begin{axis}[width=5cm, height=5cm, xlabel=X2, ylabel=Y2]

\addplot[scatter, only marks, mark=x, mark size=4pt]

coordinates

{

(10, 9.14)

(8.0, 8.14)

(13, 8.74)

(9, 8.77)

(11, 9.26)

(14, 8.10)

(6, 6.13)

(4, 3.1)

(12, 9.13)

(7, 7.26)

(5, 4.74)

};

\addplot[scatter, mark=.]

coordinates

{

(0, 4.1)

(20, 12.5)

};

\end{axis}

\end{tikzpicture}

\end{subfigure}

\

\begin{subfigure}{.45 \textwidth}

\centering

\caption{Clear linear relationship, but one outlier offsets the regression line}

\begin{tikzpicture}

\begin{axis} [width=5cm, height=5cm, xlabel=X3, ylabel=Y3]

\addplot[scatter, only marks, mark=x, mark size=4pt]

coordinates

{

(10, 7.46)

(8.0, 6.77)

(13, 12.74)

(9, 7.11)

(11, 7.81)

(14, 8.84)

(6, 6.08)

(4, 5.39)

(12, 8.15)

(7, 6.42)

(5, 5.73)

};

\addplot[scatter, mark=.]

coordinates

{

(0, 4.1)

(20, 12.5)

};

\end{axis}

\end{tikzpicture}

\end{subfigure}

\begin{subfigure}{.45 \textwidth}

\centering

\caption{Clear relationship, but one outlier puts the regression line at 45 degrees to the other 10 observations}

\begin{tikzpicture}

\begin{axis} [width=5cm, height=5cm, xlabel=X4, ylabel=Y4]

\addplot[scatter, only marks, mark=x, mark size=4pt]

coordinates

{

(8, 6.58)

(8.0, 5.76)

(8, 7.71)

(8, 8.84)

(8, 7.04)

(8, 5.26)

(19, 12.5)

(8, 5.56)

(8, 7.91)

(8, 6.89)

(8, 6.89)

};

\addplot[scatter, mark=.]

coordinates

{

(0, 4.1)

(20, 12.5)

};

\end{axis}

\end{tikzpicture}

\end{subfigure}

\end{figure}

\end{document}