More questions than participants

More questions than participants

This post is a companion to my Junk Charts post on why we can't trust the research which purportedly showed that USA-Today chartjunk is "more useful" than Tuftian plain graphics. Here is an example of the two chart types they compared:

Useful_junk1


In this post, I discuss how to read a paper such as this that describes a statistical experiment, and evaluate its validity.

***

First, note the sample size. They only interviewed 20 participants. This is the first big sign of trouble. Daniel Kahneman calls this "law of small numbers", the fallacy of generalizing limited information from small samples. For a "painless" experiment of this sort in which subjects are just asked to read a bunch of charts, there is no excuse to use such a small sample.

***

Next, tally up the research questions. At the minimum, the researchers claimed to have answered the following questions:

  1. Which chart type led to a better description of subject?
  2. Which chart type led to a better description of categories?
  3. Which chart type led to a better description of trend?
  4. Which chart type led to a better description of value message?
  5. Did chart type affect the total completion time of the description tasks?
  6. Which chart type led to a better immediate recall of subject?
  7. Which chart type led to a better immediate recall of categories?
  8. Which chart type led to a better immediate recall of trend?
  9. Which chart type led to a better immediate recall of value message?
  10. Which chart type led to a better long-term recall of subject?
  11. Which chart type led to a better long-term recall of categories?
  12. Which chart type led to a better long-term recall of trend?
  13. Which chart type led to a better long-term recall of value message?
  14. Which chart type led to more prompting during immediate recall of subject?
  15. Which chart type led to more prompting during immediate recall of categories?
  16. Which chart type led to more prompting during immediate recall of trend?
  17. Which chart type led to more prompting during immediate recall of value message?
  18. Which chart type led to more prompting during long-term recall of subject?
  19. Which chart type led to more prompting during long-term recall of categories?
  20. Which chart type led to more prompting during long-term recall of trend?
  21. Which chart type led to more prompting during long-term recall of value message?
  22. Which chart type did subjects prefer more?
  23. Which chart type did subjects most enjoy?
  24. Which chart type did subjects find most attractive?
  25. Which chart type did subjects find easiest to describe?
  26. Which chart type did subjects find easiest to remember?
  27. Which chart type did subjects find easiest to remember details?
  28. Which chart type did subjects find most accurate to describe?
  29. Which chart type did subjects find most accurate to remember?
  30. Which chart type did subjects find fastest to describe?
  31. Which chart type did subjects find fastest to remember?

I think I made my point. There were more research questions than participants. Why is this bad?

Let's do a back-of-the-envelope calculation. First, think about any one of these research questions. For a statistically significant result, we would need roughly 15 of the 20 participants to pick one chart type over the other. Now, if the subjects had no preference for one chart type over the other, what is the chance that at least one of the 31 questions above will yield a statistically significant difference? The answer is about 50%! Ouch. In other words, the probability of one or more false positive results in this experiment is 50%.

For those wanting to see some math: Let's say I give you a fair coin for each of the 31 questions. Then, I ask you to flip each coin 20 times. What is the chance that at least one of these coins will show heads more than 15 out of 20 flips? For any one fair coin, the chance of getting 15 heads in 20 flips is very small (about 2%). But if you repeat this with 31 coins, then there is a 47% chance that you will see one of the coins showing 15 heads out of 20 flips! The probability of at least one 2% event is 1 minus the probability of zero 2% events; the probability of zero 2% events is the product (31 times) of the probability of any given coin showing fewer than 15 heads in 20 flips (= 98%).

Technically, this is known as the "multiple comparisons" problem, and is particularly bad when a small sample size is juxtaposed with a large number of hypotheses.

***

Another check is needed on the nature of the significance, which I defer to a future post.