I’ve been learning a lot about logistic and linear regression in biostats over the last few weeks. For my MPH degree, I had to learn about these concepts as well, but being away for six years and not having to use them in epidemiological investigations kind of eroded my knowledge of how to do them. The concept of regressions is easy. Based on a bunch of data, you make an estimation of what the next value of Y will be based on X. The easiest example is the difference between Fahrenheit and Celsius temperatures. You know that water freezes at 0 degrees C, which is equal to 32 degrees F. You also know that water boils at 100 degrees C, which is equal to 212 degrees F. And you know that the relationship is “linear” because you can draw a straight line between 32 and 212 (or 0 and 100).
As you can see, the relationship is linear. It’s neat. All the points fall along the line. If I told you a value in Celsius, you can give me the Fahrenheit value based on the chart without knowing the mathematical formula for it. That is, you can use the predictor X to give me an estimate of Y.
Not everything has a linear relationship, however. Although you can use linear regression to get those estimated (or expected) Ys, there is some “error”, some variability that will make Y only an estimate, a best guess. Consider this group of data describing the linear relationship between life satisfaction and gross domestic product:
If I told you a country’s gross domestic product per person, in dollars, you could use the logistic regression (logistic because it’s on a log scale) and estimate the life satisfaction score for that country. However, as you can see, you would be off a bit if you used it for, say, Mexico or Brazil. The points for those two countries and several others are off from the line. That distance from the line is error. But the line was drawn to be the best fitting line, where all the points have the same distance from the line (on average and in the long run).
We humans think the same way. We see a cluster of data and think that there is a pattern there. Our brains then draw a line and we think we can estimate Y based on X, even when X and Y have absolutely nothing to do with each other. Like this:
If you believe in your heart that organic food contributes to autism, you’ll look at that graph up there and be convinced that this is incontrovertible evidence of your belief. As you know, anti-vaccine people think that there is a causative relationship when they see that the number of vaccines has increased across time and so have the diagnoses of autism, and they won’t listen to evidence that shows that we get less antigens per vaccine and that modifications in the diagnosis requirements of autism have also made it easier to detect. The kid that was once “weird” or “touched” is now correctly diagnosed as autistic, opening the door to services and treatments.
As the post about organic food sales and autism states: correlation does not equal causation. Nevertheless, people keep pushing the issue. Heck, I’ve even done it some times when I can’t find an explanation for something. I see two things that correlate and go, “Sure, I guess that’s it!” But I try very hard to understand what is going on and come to a logical conclusion. Not everyone does this, and it can be scary.