Thursday, January 8, 2009

Charts and Graphs 1: Missing Data and Irregular Intervals

Stephen Few wrote about Line Graphs and Irregular Intervals, and the debate rages on.

I think the original graph of postage stamp prices is fine. The x-axis uses regular intervals, and the points demarcate the actual known values. I agree that a step/bar graph might be preferable for some purposes, but if you want to see if the rise in stamp prices is in line with inflation, the line is better. If a step graph is used, the trend line should connect the midpoints of the bars (see my version, which includes a CPI line). In effect this spreads out the changes as though they were more continuous, and the total area under the graphs would be about the same.

However, the example of households with computers and internet access has many problems. The original is missing data, but the x-axis it uses is categorical. Instead, it should use a continuous scale so that the gaps are apparent. Like the first one, it uses both points and lines. The points indicate the known data, and the lines help you to interpolate what the missing values might be. Another problem with it is that it seems to have a different purpose. The title of this chart says, "In 2003, more than 88% of households owning a computer were online, up 40% from 1997." To arrive at this fact requires dividing the Internet Access number by the Presence of Computer number. Instead, why not just graph this ratio on the chart? That's what I've done in the second chart below. Notice how this allows you to also see that a greater percentage of computer households had internet access in 2001 than in 2003.

Here's a corrected version that I created using Style Chart. Notice the missing bars and points.

To handle missing data points, there are a few different options:

  • Drop the lines altogether. But it is easier to see the slope of lines than the slope between 2 points that you mentally draw a line between.

  • Don't draw a line when values are missing. This is okay when you have 1 line, but is hard to look at when there are multiple.

  • Drawing a different connector, e.g. a dotted line. This helps in slope analysis, but make it very clear that there is missing data that has been interpolated.

  • Drawing points and lines. This is the most common and, in my opinion, the most intuitive. Line graphs always do some amount of interpolation, otherwise it's really a point graph.