Sunday, November 15, 2009

The base rate fallacy

I had reason to look up the base rate fallacy a few minutes ago. Here's a very nice example of what it is and why it is relevant to our real lives today. This is a total copy and paste job from Wikipedia, with some corrections:

In a city with 100 terrorists and 1,000,000 non-terrorists (total population: 1,000,100), there is a surveillance camera with automatic face recognition software. If the camera sees a known terrorist, it will ring a bell with 99% probability. If the camera sees a non-terrorist, it will trigger the alarm 1% of the time. So, the failure rate of the camera is always 1%.

Suppose somebody triggers the alarm. What is the chance he/she is really a terrorist?

Someone making the base rate fallacy would incorrectly claim that the false alarm rate must be 1 in 100 because the failure rate of the device is 1 in 100, and so he/she is 99% sure to be a terrorist if the device rings. The fallacy arises from the assumption that the device failure rate and the false alarm rate are equal.

This assumption is incorrect because the camera is far more likely to encounter non-terrorists than terrorists. (Paca: This is the key sentence in the whole thing. While the machine only falsely tags 1% of the people it sees, it sees non-terrorists relentlessly and only sees an actual terrorist once in a blue moon, giving it a chance to make an error on non-terrorists a lot more often than it has a chance to make an error on a real terrorist.) The higher frequency of non-terrorists increases the false alarm rate.

Imagine that all 1,000,100 people in the city pass in front of the camera. About 99 of the 100 terrorists will trigger a ring — and so will about 10,000 of the one million non-terrorists. Therefore the camera will claim that 10,099 people are terrorists, and only 99 of them are in fact terrorists -- despite the fact that the camera only fails 1% of the time. So, the probability that a person who triggers the alarm is actually a terrorist is 99 in 10,099 (about 1/102). (Paca: So, practically, the camera is almost always wrong.)

The base rate fallacy is only fallacious when non-terrorists outnumber terrorists, or conversely. In a city with about 50% terrorists and about 50% nonterrorists, the real probability of misidentification won't be far from the failure rate of the device.

Paca: The same is true for something like a medical test as well. Image a test which examines a bit of tissue and gives a breast cancer diagnosis. The machine is well made and only makes a mistake 1% of the time. However, breast cancer only occurs in about 13% of American women. This means that a machine that only makes a mistake 1 in 100 times will still actually tag more women who do NOT have breast cancer as having breast cancer than it will tag women who actually do have breast cancer. Not because it's a poorly made machine, but because 87% of the women it sees do not have breast cancer.

If you are into the math (because I know you are), what we want to know is the probability of cancer given a positive diagnosis. This is P(Cancer | Diagnosis = Yes). The error rate of the machine is actually (simplifying) the probability of a diagnosis given cancer ( P(Diagnosis = Yes|Cancer)), i.e., when cancer is present in the tissue sample, it correctly says yes 99% of the time. These are not the same thing.

P(Cancer | Diagnosis = Yes) does NOT = P (Diagnosis = Yes | Cancer)


P (Cancer | Diagnosis = Yes) DOES EQUAL** P(Diagnosis = Yes | Cancer)*P(Cancer)

That last term, the P(Cancer) is the probability that someone has cancer regardless of any medical diagnosis or other evidence. It's called the "base rate" or "prior probability".

The base rate fallacy is also inherent in funny statistical claims such as: Did you know that 99% of murderers eat bread?! It may be true, but the base rate of any person eating bread is probably also around 99%.

Returning to the cancer diagnosis machine, let's say we have a machine that tests 1,000 American women for breast cancer. Because the overall base rate of breast cancer is approximately 13%, we know, before any tests are done, that 130 of these women will have breast cancer while 870 will not have breast cancer. (13% * 1,000 women). How will the machine do? Let's assume it's still got an error rate of only 1%. So it will falsely say that 1% of women who do not have breast cancer do in fact have breast cancer, which is 1% * 870 or 87 women. It will also get 99% of the women who really do have breast cancer or 99% * 130 = 129 women. So, it will claim that 216 women have breast cancer when only 129 actually do. 40% of its claims of breast cancer will be false positives -- despite only making the wrong actual diagnosis 1% of the time.

This is the case with a disease that's fairly common. Over 1 in 10 women are likely to get breast cancer in their lives. The problem only compounds as the disease gets more rare.

**Technically, proportional to, since I'm ignoring the denominator)

1 comment:

pjd said...

Once again, you've written something that I couldn't put down. (OK, I wasn't really holding it since it's on my PC screen, but you get the point.) I'm not sure how this affects my daily life except in a general media literacy sense (i.e. if I can detect BS statistics in the average FOX News report with 99% accuracy, then I only get false positivies... um, there would be no false positives in that example, I guess) but it is fascinating, well written, and a highly accessible discussion of a reasonably mind-twisting topic.

Anyway, I hope your NaNo novel is as gripping as most of your blog posts. If so, you are headed for much success.