# Why you should graph data

Amended 10 March 2016 (corrections/ update made)

I came across Anscombe’s Quartet on Wikipedia recently. I must confess to not having seen it before and don’t recall seeing it in any introductory statistics books.

The Anscombe’s Quartet is a conceptually and graphically clear way of showing the importance of graphs in statistical analysis. Each of the 11 pairs of observations have the same, x mean, y mean, x variance, y variance, correlation co-efficient and regression equation, though each have very different distributions. They clearly demonstrate the impact of outliers and how non-linear relationships can be identified.

Citation:

F. J. Anscombe (1973) Graphs in Statistical Analysis The American Statistician , Vol. 27, No. 1 (Feb., 1973), pp. 17-21

Article Stable URL: http://www.jstor.org/stable/2682899 (Not open access)

LaTeX code below.
 \documentclass{article} \usepackage{pgfplots} \usepackage{pgfplotstable} \pgfplotsset{compat=1.7} \usepackage{amssymb, amsmath} \usepackage{subcaption} \begin{document} \begin{figure} \caption{Anscombe's quartet is a good demonstration why a scatterplot is so valuable, prior to calculating regression equations and correlation co-efficients. In all four cases the $x's$ have a mean of 9, and variance of 11. The mean of all the $y's$ is 7.5, and a variance 4.125. The correlation co-efficient of each is 0.816 and the linear regression line is $y=3+0.5x$} \begin{subfigure}{.45 \textwidth} \centering \caption{Normal linear relationship} \begin{tikzpicture} \begin{axis} [width=5cm, height=5cm, xlabel=X1, ylabel=Y1] \addplot[scatter, only marks, mark=x, mark size=4pt] coordinates { (10, 8.04) (8.0, 6.95) (13, 7.58) (9, 8.81) (11, 8.33) (14, 9.96) (6, 7.24) (4, 4.26) (12, 10.84) (7, 4.82) (5, 5.68) }; \addplot[scatter, mark=.] coordinates { (0, 4.1) (20, 12.5) }; \end{axis} \end{tikzpicture} \end{subfigure} \begin{subfigure}{.45 \textwidth} \centering \caption{Relationship clear, but not linear} \begin{tikzpicture} \begin{axis}[width=5cm, height=5cm, xlabel=X2, ylabel=Y2] \addplot[scatter, only marks, mark=x, mark size=4pt] coordinates { (10, 9.14) (8.0, 8.14) (13, 8.74) (9, 8.77) (11, 9.26) (14, 8.10) (6, 6.13) (4, 3.1) (12, 9.13) (7, 7.26) (5, 4.74) }; \addplot[scatter, mark=.] coordinates { (0, 4.1) (20, 12.5) }; \end{axis} \end{tikzpicture} \end{subfigure} \ \begin{subfigure}{.45 \textwidth} \centering \caption{Clear linear relationship, but one outlier offsets the regression line} \begin{tikzpicture} \begin{axis} [width=5cm, height=5cm, xlabel=X3, ylabel=Y3] \addplot[scatter, only marks, mark=x, mark size=4pt] coordinates { (10, 7.46) (8.0, 6.77) (13, 12.74) (9, 7.11) (11, 7.81) (14, 8.84) (6, 6.08) (4, 5.39) (12, 8.15) (7, 6.42) (5, 5.73) }; \addplot[scatter, mark=.] coordinates { (0, 4.1) (20, 12.5) }; \end{axis} \end{tikzpicture} \end{subfigure} \begin{subfigure}{.45 \textwidth} \centering \caption{Clear relationship, but one outlier puts the regression line at 45 degrees to the other 10 observations} \begin{tikzpicture} \begin{axis} [width=5cm, height=5cm, xlabel=X4, ylabel=Y4] \addplot[scatter, only marks, mark=x, mark size=4pt] coordinates { (8, 6.58) (8.0, 5.76) (8, 7.71) (8, 8.84) (8, 7.04) (8, 5.26) (19, 12.5) (8, 5.56) (8, 7.91) (8, 6.89) (8, 6.89) }; \addplot[scatter, mark=.] coordinates { (0, 4.1) (20, 12.5) }; \end{axis} \end{tikzpicture} \end{subfigure} \end{figure} \end{document}