21 things I learned from my biostats midterm exam

If you’ll indulge me, I’d like to write down some things I learned from my biostats exam in an attempt to learn from my errors, memorize some concepts, learn them as well, and then get ready for next week’s final exam.

#21 – Read the question, twice if you need to. More than half of my wrong answers came from not reading the question thoroughly. It took me about an hour to finish the exam when two hours were allowed to do it. That means I could have easily gone back and reviewed my answers. So that’s what I’m doing from here on out, reading the question, understanding it, writing out the linear model, and then reading the question again. It also doesn’t hurt to read the answers thoroughly. I do not want to be these guys:

httpvh://youtu.be/0Urjr2RCwis

#20 – In a simple linear regression model Y = ß0 + ß1 (X-30) + e, the interpretation of ß0 is the average Y for persons whose X = 30. That is, the X = 30 makes ß1 times (x-30) equal zero, leaving ß0 equaling Y.

#19 – “Odds” equals the probability of events divided by the probability of non-events. For example, if I have ten people and six come down with an illness, the odds of the illness is 6/4, or 1.5. An “odds ratio”, on the other hand, is the odds of an event in one group divided by the odds of an event in another group. If I have another group of 12 people, and six come down with an illness, then their odds is 6/6, or 1.0, and the odds ratio of the ten-person group and the twelve-person group then becomes 1.5 divided by 1.0, which is 1.5. That is, the people in the ten-person group have about 50% greater odds of coming down with the illness. So “odds” and “odds ratio” are different things, and they are represented differently when you do math.

#18 – The GLM “family” used in STATA for a model where the outcome is binary is, oddly enough, “binomial”. (I’m just kidding about it being “odd”.)

#17.1 – “The sums of squares within groups (SSW) is based on a weighted average (pooling) of the within-group sample variance across the p groups.” Yeah, I knew that.

#16.1 – The aim of multiple comparisons correction for pairwise comparisons of group means is to “control for the probability of finding a statistically significant difference between any of the pairs of means when no true differences exist.”

#17.2 – The correlation coefficient of -0.3 indicates that the slope is negative. That is, the regression line will go down and to the right… Down and to the right. If you square that -0.3, you get 0.09, which tells you that only about 9% of the predicted values are accounted for by the predictor, so the points on the scatterplot are going to be not that close to the regression line.

#16.2 – Again, odds is the probability of times something happens divided the probability of times it doesn’t happen. Now, if you’re given proportions, then it’s the proportion of it happening divided by the proportion of it not happening. If you think it’s silly that I repeat this, it’s not. It seems to be a continuing issue with me that I get all jumbled up in the language. I could blame it on not being a native English speaker, but I’ve been speaking English far too long to not know better.

#15 – In the evaluation of an odds ratio, a 95% confidence interval that does not include 1.0 points to statistical significance, meaning that you are 95% confident that the true odds ratio in the population you’re evaluating is within the confidence interval you’ve specified.

#14 – Unlike the standard error for hypothesis testing, where you used a pooled measurement, the standard error for a 95% confidence interval for the difference in proportion between one group and the other is the proportion of events in the first population multiplied by the proportion of non-events in that same population all divided by the number in the first population… And then all that added to the result of the proportion of events in the second population multiplied by the proportion of non-events in that second population divided by the number in the second population. Trust me when I tell you that it makes sense mathematically.

#13 – Simple linear regression is NOT appropriate for binary outcomes. For binary outcomes (Yes/No, Male/Female), you use logistic regression.

#12 – If Bartlett’s test for equal variances leads you to reject the null hypothesis, meaning that the variances among the groups are different, DO NOT do a one-way analysis of variance (ANOVA).

#11 – The square root of the mean squares within a group (MSW) is the residual population standard deviation in a one-way ANOVA, assuming equal group variances.

#10 – If you have four cats, you need three dummies. (This only makes sense to me, but, trust me, it makes sense.)

#9 – For every one unit change in X, you get ß change in Y. So, change X by 5 units, and your difference in Y is 5*ß.

#8 – The sample correlation coefficient tells you the direction of the line (upward or downward) and how closely the predictor predicts Y.

#7 – If you multiply the coefficient by any number (A), you can also multiply the confidence interval’s upper and lower limits to get the confidence interval for the coefficient after a change of A.

#6 – The sample correlation coefficient is the square root of R-square, and it tells us how close of a correlation there is between Y and X.

#5 – The acronym L.I.N.E. can be used to examine a regression. The N in “LINE” stands for normal. That is, you have to assume that there is a normal distribution of Y for every measure of X.

#4 – I should really use graphical representations of the formulae to help me. I’m a visual person and these word problems are better for me if I can visualize them:

hand-made graph
I did this!

I should use graphs like the one above to look at residuals and calculate which value of Y will have the greatest/smallest residual against the fitted line.

#3 – In the estimation of the least squares slope, the value of X furthest from the mean will have the most weight.

#2 – If you have a dichotomized X (with values 1 or 0), then the sum of ß0 and ß1 will equal the expected Y among 1’s based on the model Y = ß0 + ß1X. (In other words, write out the damn model!)

#1 – Reject the null hypothesis that the slopes in the regression are all equal to 0 if and only if the F-statistic has a p-value of less than 0.05. Otherwise, fail to reject it and throw the whole thing out because your X have 0 slopes, meaning that they have no influence on the prediction of Y.

Final exam is next week. I’m not at all ready, but I will be… And I will give it my all.

I'm a doctoral candidate in the Doctor of Public Health program at the Johns Hopkins University Bloomberg School of Public Health. All opinions posted here are my own, of course, and they do not necessarily reflect the opinions of my school, employers, friends, family, etc. Feel free to follow me on Twitter: @EpiRen