The Pandemics You Can Try to Stop

In mid-April 2009, I noticed that the influenza season in Mexico had not ended the way it should. I contacted friends and acquaintances in Mexico City and other parts of the country to ask them what they were seeing. Epidemiologist colleagues told me that they had noticed the same thing. A season that should have been trending down was not doing so, and local hospitals and clinics were being overrun by people with respiratory symptoms.

That night, I emailed my fiancee (now my wife) and told her that I had a bad feeling about what was going on. I told her that it made no sense for Mexico to still be seeing their season being so prolonged. The US season was winding down, though. I remember writing to her that I hoped I was wrong.

After that whole thing was done and over with in 2010 (because it swung around and hit us twice in 2009), I wondered if there was something that we could have done to stop it. By “we,” I mean us in public health. At the time, I was just a freshman epidemiologist with a couple of years of experience. In ten years time, I would be a doctor of public health, and we would be facing another pandemic… The next pandemic.

Can pandemics be stopped? In one word: Yes. But it is very complicated. For example, the HIV/AIDS pandemic could have been stopped if enough resources had been put into place the minute it was identified as a sexually-transmitted infection. HIV was infecting the “undesirables,” though, and enough leaders (religious and political) were calling it a godsend to get rid of said undesirables. It was a punishment from above, not the continuation of a zoonosis that had started decades earlier.

Respiratory infections are a whole other animal, though, especially the ones with relatively (RELATIVELY) low fatality rates. Those with higher rates, like Ebola, kill the hosts before the infection is spread too widely. (Global travel is challenging that, however.) Those that infect you, incubate, and then attack but leave you well enough to have you go to school or work have the ability to really cause disruptions.

I mean, look out the window right now if you’re March 2020.

Bacterial pandemics, like the ones that cholera has caused, are mostly under control by our ability to deploy control measures (clean water and vaccines) and antibiotics. But we’re also entering a sort of “post-antibiotic” era where bacteria are evolving faster than we can make antibiotics against them. So future bacterial pandemics will also require control measures that are not pharmaceutical in nature.

What we can do — and we should do at the end of this pandemic — is have a foolproof, well-researched, practiced yearly, top-of-the-line pandemic preparedness plan that spans the entire spectrum of everything we know has happened and could happen. From what a person will do in their everyday life to what small businesses will do when left with no workers and no employees, to what big groups and organizations will do to keep the disease from spreading. We can’t go blind into the next one — like we did into this one — because the next one could be the big one.

Then there are the epidemics and pandemics of non-communicable diseases, like obesity, diabetes and opioid use/abuse. Those are going to be super-difficult to figure out, perhaps more difficult than infectious disease. This is because we are social animals who’ve managed to separate into tribes and social strata. If something is happening to “them” and not “us,” and it will stay “over there” and not come “here,” we kind of look the other way.

Here’s an example… In Philadelphia, like in other cities in the United States, there is an epidemic of opioid use and opioid overdoses going on. Many of the people using and abusing opioids are using heroin, an injected opioid. (You can also smoke it, by the way.) When people inject heroin and other drugs, their risk for blood borne infections skyrockets. They share needles or trade drugs for sex (that is performed unsafely), and they get infections with Hepatitis B, Hepatitis C and HIV.

Look at what is happening in Minnesota.

Look at what happened in Indiana, that bastion of public health.

No doubt, Philadelphia is lining up to be the next epicenter of both overdose and blood born infections… If it isn’t already. To counter this, city health officials and health leaders have proposed a safe injection site. In a safe injection site, the user goes in, gets a clean needle and a place to rest. They get medical supervision while they use their drug of choice. Should something go sideways, they get immediate medical attention.

But drug addiction — against all evidence — is thought by many to be something that happens to “them,” the “others,” the undesirables. It doesn’t happen to “us,” the clean people, the God-fearing people. And if something is to be done for “those poor people,” it better not be done in our back yard, or my neighborhood, or anywhere that could possibly make me think that help is happening at all.

On Monday, March 16, 2020, supporters and detractors of a safe injection site in Philadelphia came together to give their opinion on a bill that would ban such help for “those people.” As you can imagine, the discussion was lively, including some gems like:

“Why would we want to be the first to experiment on this?… “It makes no sense whatsoever. I’m full of compassion for [people suffering from drug addiction], but I’m more full of compassion for my residents and all the residents symbolized by these civic associations.”

There are no safe injection sites in the United States, but there are plenty in other parts of the world. Those other sites have shown success in reducing overdoses and in guiding users into recovery programs. On top of receiving clean needles and medical supervision, they also are referred to care, and many of them take it. Some place in the United States, a place where these programs are needed, is going to be the first place, the “experiment.”

But the comments did not stop there.

Capozzi’s sentiment was echoed in the testimony of South Philly resident Anthony Giordano, who represented a community group called Stand Up South Philly and Take Our Streets Back.
“Safe injection sites are not safe,” he said. “Allowing people to consume illegal drugs of unknown composition in a so-called medical facility is beyond my comprehension. How is this safe? Helping people further harm themselves under the guise of a legitimate medical intervention just doesn’t make any sense.”

Some people tried to use science and reason:

“It amazes me that we’re sitting here talking about making a medical decision and we’re listening to public opinion,” she said. “We need to make this based on information like Dr. Farley suggested: medical consensus, meta-analyses and a medical opinion.”
Milas was unique among the four medical professionals because the opioid crisis had affected her a bit more personally. She had two sons die of opioid overdoses – one was 27 and other 31, she said.
“At the 100 legal supervised injection sites worldwide, there are no recorded deaths,” she testified. “Had my sons overdosed at a Safehouse-type facility, they would have had a 100 percent chance of survival.”
Roth piled on.
“The scientific evidence from peer-reviewed journals on these sites is clear,” she testified. “They reduce overdose mortality rates, HIV, environmental hepatitis risk, they improve access to health and social services, they help reduce substance use and help people enroll in treatment. Furthermore, they’ve helped improve community health and safety. In neighborhoods where a [safe injection site] exists, there are actually reductions in public injection and improperly discarded syringes, reductions in drug-related crime, and the demand for ambulance services for opioid-related overdoses goes down.”

You can read the rest of the South Philly Review article to see how one of the legislators used a very flawed non-scientific “study” to support his claims that safe injection sites are absolute evil. That’s where scientific discourse in public policy has gone… To unsubstantiated and flawed opinion surveys.

As I’ve stated before, several times, public health in the United States and in much of the world is all about politics. You better pray that the right political party is in power, or the right people are in power, so that the decisions that need to be made are informed by evidence and science more than the “what ifs” of public opinion. This is Democracy getting in the way of things, unfortunately.

As we saw in Wuhan, China, when authorities there felt the need to shut down cities, they did so without any apparent issues. (There might have been issues, but we’ll be darned if we ever find out.) That’s an authoritarian government for you in a very collectivist society. Can you imagine trying to shut down even a small town here in the United States? With people with guns? And SUVs?

Good luck.

So, yeah, we might not be able to stop this pandemic, or the next one. After all this, I’m going to focus on having what I call “premier” surveillance systems and response plans. We’re going to learn a lot from this pandemic, and I plan to make it my life’s work (on top of all of my other work) to make sure we don’t forget about this time, next time.

Until next time… Thanks for your time.

Local Indicators of Spatial Association in Homicides in Baltimore, 2019

Last year was yet another record year for homicides in Baltimore with 348 reported homicides for the year. While this number was the second-highest in terms of total homicides, it was probably the highest in terms of rate per 100,000 residents since Baltimore has been seeing a decrease in population. (The previous record was in 2015 and 2017.) The epidemic of violence and homicides that began in 2015 continues today.

When we epidemiologists analyze data, we need to be mindful of the spatial associations we see. The first law of geography states that “everything is related to everything else, but near things are more related than distant things.” As a result, clusters of cases of a disease or condition — or an outcome such as homicides — may be more a factor of similar things happening in close proximity to each other than an actual cluster.

To study this phenomenon and account for it in our analyses, we use different techniques to assess spatial autocorrelation and spatial dependence. In this blog post, I will guide you through doing a LISA (Local Indicators of Spatial Association) analysis of homicides in Baltimore City in 2019 using R programming.

First, The Data

Data on homicides was extracted from the Baltimore City Open Data portal. Under the “Public Safety” category, we find a data set titled “BPD Part 1 Victim Based Crime Data.” In those data, I filtered the observations to be only homicides and only those occurring/reported in 2019. This yielded 348 records, corresponding to the number reported for the year.

Data on Community Statistical Areas (agglomerations of neighborhoods based on similar neighborhood characteristics came from the Baltimore Neighborhood Indicators Alliance (BNIA). They are a very good source of data on Baltimore with regards to social and demographic indicators. From their open data portal, I extracted the shapefile of CSAs in the city.

Next, The R Code

First, I load the libraries that I’m going to use for this project. (I’ve tried to annotate the code as much as possible.)


Next, I load the data.

csas <-
          "Vital_Signs_17_Census_Demographics") # Baltimore Community Statistical Areas
jail <-
          "community_statistical_area") # For the central jail
pro <-
  sp::CRS("+proj=longlat +datum=WGS84 +unit=us-ft") # Projection
csas <-
                  pro) # Setting the projection for the CSAs shapefile
homicides <- read.csv("baltimore_homicides.csv",
                      stringsAsFactors = F) # Homicides in Baltimore
homicides <-, function(x)
    x[!$Latitude)])) # Remove crimes without spatial data (i.e. addresses)

Note that I had to do some wrangling here. First, there is an area on the Baltimore Map that corresponds to a large jail complex in the center of the city. So I added a second map layer called jail to the data. I also corrected the projection on the layers. (This helps make sure that the points and polygons line up correctly since you’re overlaying a flat piece of data on a spherical part of the planet.) Finally, I took out homicides for which there was no spatial data. This sometimes happens when the true location of a homicide cannot be determined… Or when there is a lag/gap in data entry.

Next, I’m going to convert the homicides data frame into a spatial points file.

coordinates(homicides) <-
  ~ Longitude + Latitude # Makes the homicides CSV file into a spatial points file
proj4string(homicides) <-
  CRS("+proj=longlat +datum=WGS84 +unit=us-ft") # Give the right projection to the homicides spatial points

Note that I am telling R which variables in homicides corresponds to the longitude and latitude variables. Next, as I did before, I corrected the projection to match that of the CSAs layer.

Now, let’s join the points (homicides) to the polygons (csas).

homicides_csas <-,
                                csas) # Gives each homicide the attributes of the CSA it happened in
homicide.counts <- # Creates the table to see how many homicides in each CSA
counts <-
  homicide.counts %>%
  group_by(CSA2010, tpop10) %>%
  summarise(homicides = n()) # Determines the number of homicides in each CSA.
counts$tpop10 <-
  as.numeric(levels(counts$tpop10))[counts$tpop10] # Fixes tpop10 not being numeric
counts$rate <-
  (counts$homicides / counts$tpop10) * 100000 # Calculates the homicide rate per 100,000
map.counts <- geo_join(csas,
                       how = "left") # Joins the counts to the CSA shapefile
map.counts$homicides[$homicides)] <-
  0 # Turns all the NA's homicides into zeroes
map.counts$rate[$rate)] <-
  0 # Turns all the NA's rates into zeroes
jail <-
  jail[jail$NEIG55ID == 93, ] # Subsets only the jail from the CSA shapefile

Here, I also created a table to see how many points fell inside the polygons. I allowed me to look at how many homicides were in each of the CSAs. Here are the top five:

Community Statistical AreaHomicide Count
Southwest Baltimore29
Sandtown-Winchester/Harlem Park25
Greater Rosemont21
Greenmount East15

Of course, case counts alone are not giving us the whole information. We need to account for the differences in population numbers. For that, you’ll see that I created a rate variable. Here are the top five CSAs by homicide rate per 100,000 residents:

Community Statistical AreaHomicide Rate per 100,000 Residents
Greenmount East191
Sandtown-Winchester/Harlem Park189
Southwest Baltimore172
Madison/East End167

As you can see, the ranking based on homicide rates is different, but it is more informative. It accounts for the fact that you may see more homicides in a CSA simply because that CSA has more people in it. Now we have the rates. I then joined the counts and rates to the csas shapefile. This is what we’ll use to map. I also made sure that all the NA’s in the rates and counts are zeroes. Finally, I used only the CSA labeled “93” for the jail. This, again, is to make sure that the map represents that area and we know that it is not a place where people live.

The Maps

Look at the following simple plot of homicide locations in Baltimore:

Notice Something?

This is not a map. This is just a plot of the homicide locations. The latitude is on the Y axis and the longitude on the X axis. If you notice that there is some clustering of cases, you would be correct. Point-based cluster analysis can also be done on these data alone, but that is for a different blog post at a later time. (And you have to take other things into consideration.)

Next, I’m going to create a map of the points above (using the tmap package) but also show the boundaries of the CSAs. This will look more like a map and show you the spatial distribution of the homicide cases in context. You’ll see why there are blank spaces in some parts.

Here is the code:

tmap_mode("plot") # Set tmap to plot. To make interactive, use "view" <-
  tm_shape(map.counts) +
  tm_borders(col = "black",
             lwd = 0.5) +
          size = 1) +
  tm_shape(homicides) +
  tm_dots(col = "red",
          title = "2019 Homicides",
          size = 0.2, ) +
  tm_compass() +
    main.title = "Map of Homicides and Homicide Rate in Baltimore City, 2019",
    main.title.size = 0.8,
    legend.position = c("left", "bottom"),
    compass.type = "4star",
    legend.text.size = 1,
    legend.title.size = 1,
    legend.outside = T
  ) +
    text.size = 0.5,
    color.dark = "black",
    color.light = "yellow",
    lwd = 1
  ) +
    type = "symbol",
    col = "red",
    shape = 21,
    title = "Homicide",
    size = 0.5
  ) +
  tmap_options(unit = "mi") +
    type = "text",
    col = "black",
    title = "CSA ID#",
    text = "00"

And here is the map:

Click to enlarge.

Again, we see the clustering of homicides in some areas. Let’s look at homicide rates. First, the code: <-
  tm_shape(map.counts) +
    title = "Homicide Rate per 100,000",
    style = "fisher",
    palette = "OrRd",
    colorNA = "white",
    textNA = "No Homicides"
  ) +
  tm_borders(col = "black",
             lwd = 0.5) +
          size = 1) +
  tm_shape(homicides) +
  tm_dots(col = "red",
          title = "2019 Homicides") +
  tm_shape(jail) +
  tm_fill(col = "#d0e2ec",
          title = "Jail") +
  tm_compass() +
    main.title = "Map of Homicides and Homicide Rate in Baltimore City, 2019",
    main.title.size = 0.8,
    legend.position = c("left", "bottom"),
    compass.type = "4star",
    legend.text.size = 1,
    legend.title.size = 1,
    bg.color = "#d0e2ec",
    legend.outside = T
  ) +
    text.size = 0.5,
    color.dark = "black",
    color.light = "yellow",
    lwd = 1
  ) +
    type = "symbol",
    col = "red",
    shape = 21,
    title = "Homicide",
    size = 0.5
  ) +
    type = "text",
    col = "black",
    title = "CSA ID#",
    text = "00"
  ) +
  tmap_options(unit = "mi")

It stands to reason that the CSAs with the most points would also have the highest rates, so here is the map:

Click to enlarge.

From here, we see that the CSAs with the highest rates are close together. We also see that those with the lowest rates are close together. And others are somewhere in between. If there was no spatial autocorrelation, we would expect these CSAs with high or low rates to be equally distributed all over the map. If there was perfect autocorrelations, all highs would be on one side and all lows would be on the other. What we have here is something in the middle, as is the case in most real-life examples.

LISA Analysis

In the next few lines of code, I’m going to do several things. First, I’m going to create a list of the neighbors to each CSA. One CSA, #4 at the far southern tip of the city, does not touch other CSAs. It looks like an island. So I’m going to manually add its neighbors to the list.

Next, I’m going to create a list of neighbors and weights. By weights, I mean that each neighbor will “weigh” the same. If a CSA has one neighbor, then the weight of that neighbor is 1. If they have four, each neighbor — to that CSA — weighs 0.25.

Next, I’m going to calculate the Local Moran’s I value. For a full discussion of Moran’s I, I recommend you visit this page:
There are a lot of nuances to that value. But, basically, that value tells us how different from no spatial autocorrelation each CSAs is. Once you see it on the map, you’ll understand better.

Here’s the code:

map_nbq <-
  poly2nb(map.counts) # Creates list of neighbors to each CSA
added <-
  as.integer(c(7, 50)) # Adds neighbors 7 and 50 to CSA 4 (it looks like an island)
map_nbq[[4]] <- added # Fixes the region (4) without neighbors
View(map_nbq) # View to make sure CSA 4 has neighbors 7 and 50.

map_nbq_w <-
  nb2listw(map_nbq) # Creates list of neighbors and weights. Weight = 1/number of neighbors.
View(map_nbq_w) # View the list

local.moran <-
  localmoran(map.counts$rate, # Which variable you'll use for the autocorrelation
             map_nbq_w, # The list of weighted neighbors
             zero.policy = T) # Calculate local Moran's I
local.moran <-
summary(local.moran) # Look at the summary of Moran's I

map.counts$srate <-
  scale(map.counts$rate) # save to a new column the standardized rate

map.counts$lag_srate <- lag.listw(map_nbq_w,
                                  map.counts$srate) # Spatially lagged standardized rate in new column

summary(map.counts$srate) # Summary of the srate variable
summary(map.counts$lag_srate) # Summary of the lag_srate variable

Now, we’ll take a look at the Moran’s I plot created with this code:

x <- map.counts$srate %>% as.vector()
y <- map.counts$lag_srate %>% as.vector()
xx <- data.frame(x, y) # Matrix of the variables we just created

moran.plot(x, map_nbq_w) # One way to make a Moran Plot

And here is the plot:

Click to enlarge.

This plot tells us which CSAs are in the High-High (right upper quadrant), High-Low (right lower quadrant), Low-Low (left lower quadrant) and Low-High (left upper quadrant) designations for Moran’s I values. The High-High CSAs mean that their value was high and the values of the CSAs (their neighbors) were high also. That is expected since everything is like everything else close to it, right?

The Low-Low CSAs are the same. The ones that are outlier in this regard are the High-Low and Low-High, because these are CSAs that are high values (by values, we are looking at a calculation derived from standardized homicide rates) in a sea of low values, or low values in a sea of high values. The plot above has highlighted the ones that are not spatial outliers and have a significant p-value. (More on the p-value later.)

There is another way to make the same plot, if you’re so inclined:

  x = map.counts$srate,
  y = map.counts$lag_srate,
  main = "Moran Scatterplot Homicide Rate",
  xlab = "Homicide Rate (Scaled)",
  ylab = "Lagged Homicides Rate (Scaled)"
) # Another way to make a Moran Plot
abline(h = 0,
       v = 0) # Adds the crosshairs at (0,0)
  lm(map.counts$lag_srate ~ map.counts$srate),
  lty = 3,
  lwd = 4,
  col = "red"
) # Adds a red dotted line to show the line of best fit

And this is the resulting plot from that:

Click to enlarge.

So those are the plots of the Moran’s I value. What about a map of these values? For that, we need to prepare our CSAs with the appropriate quadrant designations.

map.counts$quad <-
  NA # Create variable for where the pair falls on the quadrants of the Moran plot
map.counts@data[(map.counts$srate >= 0 &
                   map.counts$lag_srate >= 0), "quad"] <-
  "High-High" # High-High
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate <= 0), "quad"] <-
  "Low-Low" # Low-Low
map.counts@data[(map.counts$srate >= 0 &
                   map.counts$lag_srate <= 0), "quad"] <-
  "High-Low" # High-Low
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate >= 0), "quad"] <-
  "Low-High" # Low-High

This gives us this map:

Click to enlarge.

Here is the code for that map:

local.moran$OBJECTID <- 0 # Creating a new variable
local.moran$OBJECTID <- 1:nrow(local.moran) # Adding an object ID
local.moran$pvalue <-
  round(local.moran$`Pr(z > 0)`, 3) # Rounding the p value to three decimal places

map.counts <- geo_join(map.counts,

colors <-
  c("red", "blue", "lightpink", "skyblue2", "white") # Color Palette <-
  tm_shape(map.counts) +
          title = "Local Moran's I",
          palette = colors,
          colorNA = "white") +
  tm_borders(col = "black",
             lwd = 0.5) +
          size = 0.5) +
  tm_shape(jail) +
  tm_fill(col = "gray",
          title = "Jail") +
  tm_compass() +
    main.title = "Map of Local Moran's I for Homicide Rates in Baltimore, 2019",
    main.title.size = 0.8,
    legend.position = c("left", "bottom"),
    compass.type = "4star",
    legend.text.size = 1,
    legend.title.size = 1,
    legend.outside = T,
    bg.color = "#d0e2ec"
  ) +
    text.size = 0.5,
    color.dark = "black",
    color.light = "yellow",
    lwd = 1
  ) +
    type = "text",
    col = "black",
    title = "Moran's I p-value",
    text = "0.000"
  ) +
  tmap_options(unit = "mi")

The CSAs in red are those that have high Moran’s I values and are surrounded by other CSAs with high values. The light blues are where there are low values and are are surrounded by lows. The dark blue are CSAs with high values but with low neighbors, and the light pink are CSAs with low values but high neighbors. The numbers inside each CSA is the p-value.

The Thing About Significance

If you’ve done your coursework on biostatistics, you’ll see that there is a lot of attention paid to p-values that are less than 0.05. In this case, only a few CSAs have such low p-values. But that may not be enough. In this explainer on spatial association, the author states the following:

An important methodological issue associated with the local spatial autocorrelation statistics is the selection of the p-value cut-off to properly reflect the desired Type I error. Not only are the pseudo p-values not analytical, since they are the result of a computational permutation process, but they also suffer from the problem of multiple comparisons (for a detailed discussion, see de Castro and Singer 2006). The bottom line is that a traditional choice of 0.05 is likely to lead to many false positives, i.e., rejections of the null when in fact it holds.


For this exercise, we are going to keep the p-value significance at 0.05 and create a map that uses the resulting CSAs. The code:

map.counts$quad_sig <-
  NA # Creates a variable for where the significant pairs fall on the Moran plot
map.counts@data[(map.counts$srate >= 0 &
                   map.counts$lag_srate >= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "High-High, p <0.05" # High-High
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate <= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "Low-Low, p <0.05" # Low-Low
map.counts@data[(map.counts$srate >= 0 &
                   map.counts$lag_srate <= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "High-Low, p <0.05" # High-Low
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate >= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "Low-High, p <0.05" # Low-High
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate >= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "Not Significant. p>0.05" # Non-significant

Note that I am using the p-value in the local.moran calculation (position #5) in addition to where on the moran plot the CSAs fall. This really eliminates the non-significant (at p = 0.05) CSAs and leaves us with only eight CSAs. Here is the code for the map: <-
  tm_shape(map.counts) +
    title = "Local Moran's I",
    palette = colors,
    colorNA = "white",
    textNA = "Not Significant"
  ) +
  tm_borders(col = "black",
             lwd = 0.5) +
          size = 0.5) +
  tm_shape(jail) +
  tm_fill(col = "gray",
          title = "Jail") +
  tm_compass() +
    main.title = "Map of Local Moran's I for Homicide Rates in Baltimore, 2019",
    main.title.size = 0.8,
    legend.position = c("left", "bottom"),
    compass.type = "4star",
    legend.text.size = 1,
    legend.title.size = 1,
    legend.outside = T,
    bg.color = "#d0e2ec"
  ) +
    text.size = 0.5,
    color.dark = "black",
    color.light = "yellow",
    lwd = 1
  ) +
    type = "text",
    col = "black",
    title = "Moran's I p-value",
    text = "0.000"
  ) +
  tmap_options(unit = "mi")

And here is the map itself:

Click to enlarge.

What this is telling us is that only those eight CSAs had a statistically significant Moran’s I value, and they were all in the “High-High” classification. This means that they have high standardized values (which are themselves derived from homicide rates) and are surrounded by CSAs with high values. There were no significant CSAs from the other quadrants on the Moran plot. (Remember that significance here comes with a big caveat.)

In Closing

What does this all mean? It means that the “butterfly pattern” seen in so many maps of social and health indicators in Baltimore still holds. The two regions in red in that last map make up the butterfly’s wings. Those CSAs are also the ones with the highest homicide rates.

Click to enlarge.

You could say that it is in those eight CSAs where homicides are the most concentrated. These are known as spatial clusters. If we had significant “Low-High” or “High-Low” CSAs, they would be known as spatial outliers. Significant “Low-Low” CSAs would also be spatial clusters.

As I mentioned above, we can also do other analyses, some of them looking at the individual locations of the homicides, the dots in the first plot and the first map. You could generate statistics that tell us whether those dots are closer together to each other than what you would normally see if they were just tossed on the map randomly.

A full geostatistical analysis includes these analyses, and you can see an example of the whole thing if you look at my dissertation. You may want to include the Local Moran’s I value or some other term (I used average number of homicides in neighbors) in a regression analysis. Or you could use composite index to account for different factors.

Calculating TB Exposure Time Using R

Suppose that you have two cases of active tuberculosis (TB), and you want to determine the number of hours ten people were exposed to them. For this, you need to know the times that your TB cases were infectious in the location where your contacts, uh, contacted them.

In this scenario, I have two cases: Case A was infectious from January 1, 2019, at 12:35 PM to January 31, 2019, at 9:30 AM. Case B was infectious from December 15, 2018, at 3:50 AM to March 14, 2019, at 10:50 AM. In real life, we don’t really know the exact time that someone became infectious. Heck, most of the time we use the average times we have in the literature to determine infectious periods, but just bare with me on this one.

Now, suppose that we’ve done some interviews and found out that ten people (Contacts “One” through “Ten”) may have come into contact with our two infectious people. Under TB investigation guidelines:

“The likelihood of infection depends on the intensity, frequency, and duration of exposure. For example, airline passengers who are seated for >8 hours in the same or adjoining row as a person who is contagious are much more likely to be infected than other passengers. One set of criteria for estimating risk after exposure to a person with pulmonary TB without lung cavities includes a cut-off of 120 hours of exposure per month. However, for any specific setting, index patient, and contacts, the optimal cut-off duration is undetermined. Administratively determined durations derived from local experience are recommended, with frequent reassessments on the basis of results.”

You could just put it all on paper and determine the amount of time the contacts spent with the cases, but that would get out of control quickly if you start dealing with hundreds of contacts and dozens of cases. I tried doing this in Excel, but the number of IFTHEN statements that I had to write just got to be too much: “If the starting time for CASE ONE is between Date 1 and Date 2, then this and that…” It got to be too much.

So I decided to try it out in R. In very few lines of code, I was able to calculate the number of hours each contact spent with the cases.

First, the data… For this, I created an Excel workbook. On one spreadsheet, named “cases,” I put the cases. On another, named “contacts,” I put the contacts. Pretty self-explanatory, right. For each case, I logged the exposure time and end (infectious period). For each contact, I logged the contact times with the cases. In this scenario, case A and case B were in a building and spent time with the ten contacts at the times and dates that the ten contacts reported. Here’s what it looks like:


Again, you could probably do this by hand by comparing case A and contact One, then A and Two and so forth. But let’s do it in R. First, the libraries:


Tidyverse will help us handle dates and data manipulation. Readxl will help read the Excel files. And DescTools will help with descriptive statistics. Next, we import the data:

cases <- read_excel("data.xlsx", sheet = "cases")
contacts <- read_excel("data.xlsx", sheet = "contacts")

Now, I’m interested in the exposure to each contact, instead of the overall contact to both cases. So I’m going to split up the contact times for each contact. (You can see where this can get tedious if done for each case when the case count is high. For that, I would create a loop.)

case_A <- cases %>% filter(case == "A")
case_B <- cases %>% filter(case == "B")

Now, I’m going to create time intervals. One for each contact, and the two for case A and case B.

contact_time <- interval(as.POSIXct(contacts$start_time), as.POSIXct(contacts$end_time), tzone="UTC")
exposure_time_A <- interval(as.POSIXct(case_A$exposure_start), as.POSIXct(case_A$exposure_end), tzone="UTC")
exposure_time_B <- interval(as.POSIXct(case_B$exposure_start), as.POSIXct(case_B$exposure_end), tzone="UTC")

Next, we need to know how many hours of overlap there are between each contact time and each exposure_time_A and exposure_time_B. The following code asks R to use lubridate (part of the tidyverse package) to calculate the intersection of each contact_time and the respective exposure times. From that calculation, we create the variables time_exposed_A and time_exposed_B that we will append to the contacts dataset.

contacts$time_exposed_A <- as.period(lubridate::intersect(contact_time, exposure_time_A), unit = "hours")
contacts$time_exposed_B <- as.period(lubridate::intersect(contact_time, exposure_time_B), unit = "hours")

In the end, we can look at our data in contacts and see the number of hours each contact was exposed to each case, if they were exposed at all:

As you can see, contacts One and Nine were exposed to case A. All of the contacts except Eight and Ten were exposed to case B. If I wanted to know just the number of days, I’d go to the code above and change “hours” to “days” to get this:

And that’s it. Now you know exactly how much time each contact spent with each case. Simple, right? One more line of code and you can save this data in another spreadsheet that you can use for whatever it is you kids use spreadsheets for nowadays.

Some Things I Learned This Week

I’ve been at the annual conference of the Council of State and Territorial Epidemiologists all this week. The first conference I attended was back in 2008, when I was working at the Maryland Department of Health as an epidemiologist doing influenza surveillance. I remember it being a lot of fun because I got to learn a lot from people who had the same interests as I did. This has not changed much since then, but things around us in the world have. Let me explain…

The opening plenary speaker was Dr. Rachel Levine, the Secretary of Health of the Commonwealth of Pennsylvania. Dr. Levine is a pediatrician by training. She is also transgender. She told us a brief history of her transition to identifying openly as a woman, and how she was lucky to have been in an institution that was very supportive of her finding her identity. Her talk was part of a bigger discussion on transgender health, which in itself was an even bigger discussion on disparities in health outcomes in the LGBT+ community.

One thing that stood out to me was her mentioning that SOGI (sexual orientation, gender identity) measures are not included in the US Census or on the American Community Survey, two huge sources of information on how the country is doing on all sorts of measures, from economics to health. That kind of made me realize that, yeah, we don’t often ask people about their sexual orientation or gender identity when surveying the public’s health. (Beyond the traditional male/female checkboxes in your average survey, of course.)

Asking these questions is not without some risk. We live in a society where people who are non-conforming to the gender standards of society are considered “weird,” and the sad-but-usual response form society toward weirdness is some form of neglect or abuse. Just look at how marginalized people who are homosexual can be in so many parts of our country, and the world. So adding an indicator to someone that singles them out as something other the norm can be used against them if someone so chooses… And, as we’ve seen from even the current presidential administration, people do choose to do harm to people based on their sexual preferences.

The worst form of harm comes in the shape of bullying. As Dr. Levine explained, there is nothing biologically special about LGBT+ people that makes them more prone to mental health issues. It all falls on the bully, and that bully takes different forms. It’s the dude in high school who thinks that there is a standard definition of what a man is supposed to sound and look like. It’s the boss who fires someone for being gay or for choosing to identify as something other than the gender assigned at birth. And it’s the government — like that of North Carolina — that chooses to pass legislature to make discrimination more de jure than de facto.

That bullying many times leads to the person being bullied to cause self-harm, and the worst outcome of that is suicide. This Monday, I start working at the Fairfax County Health Department in Virginia, and the person who is going to be my supervisor there delivered a quick talk about using emergency department visits to better inform public health on the prevalence of suicidal behaviors in the county. (Hers was part of a set of presentations on suicide and self-harm in different communities.) At the end of the presentations, someone asked about the data on LGBT+ people. Again, that little lightbulb in my head went off…

How do you ascertain the incidence and prevalence of suicidal ideation, self-harm and suicide attempts/completions in a segment of the population that is largely hidden away? It’s not like there is a clear indicator in a medical record that someone is non-heterosexual (or, if heterosexual, that they are transgender). It is probably mentioned in the medical notes, but do we really know how often that’s the case? For example, is it medically necessary to mention in the medical record that someone is transgender if they’re not being treated for something that is directly related to their gender identity? Then again, what is related to one’s gender identity or sexual orientation when it comes to health?

In the early days of the HIV/AIDS epidemic, most of the cases that were being identified were in homosexual men. However, we know that the virus does not discriminate. If anything, linking HIV exclusively to homosexual men made heterosexuals of all kinds feel perfectly safe from contracting it, exponentially intensifying the epidemic. In that sense, it was a mistake to only associate the epidemic with gay men. On the other hand, as we saw, it was a mistake to not be completely honest about the risk factors for contracting HIV.

So there is a lot of work to be done on this and many other fronts when it comes to public health in general and epidemiology in particular. We are going to have to figure out how to count those who are not counted, and listen to those without a voice. So let’s get to it…

Looking at Unmet Health Needs in Chicago, 2013

Look at this map of Chicago.

Click to enlarge

This map shows the infant mortality rate in the 77 different communities (except O’Hare, which is the airport area) in Chicago. As you can see, there are some very marked disparities. Armour Square, on the city’s east side, has the lowest infant mortality rate at 1.5. Right next to it is Douglas, and it’s infant mortality rate is 13.4. Interesting, right?

(All the data for this blog post come from Chicago’s Open Data portal.)

You would think that maybe socioeconomics has something to do with this, and you’d be partially right. However, the per capita income in Armour Square is $16,942. In Douglas, it’s $23,098. The percent of people living under the poverty line in Armour Square is 35.8%. In Douglas, it’s 26.1%.

So, if I were to show you only the map of infant mortality, I would not be informing you completely on the disparities between Armour Square and its neighboring community, Douglas. So maybe I’ll show you a second map, that of per capita income:

Click to enlarge.

Ah, now we see that Armour Square and Douglas are on the lower end of the ladder when it comes to per capita income. The higher-income communities show clearly in green. However, if I show you only this map, or even these two maps, I’m still not giving you the full picture of Chicago communities’ health, am I?

To fully do this, we in applied epidemiology use a technique to come up with a “Healthy Condition Index.” The HCI takes health indicators from five domains and standardizes them. (More on that in a second.) The five domains are: Demographics, like percent of an underserved minority living in a community or the proportion of people without a high school diploma; Socioeconomics, like per capita income; Mortality, a measure of how much death is happening in a community, like infant mortality rate; Morbidity and risk factors, like a disease or condition happening in the communities; and Coverage and Health Services, like the number of women getting timely prenatal care or the proportion of children being screened on time for lead poisoning.

We take one indicator from these five domains and make sure that they’re not too collinear, meaning that the level to which one rises or falls given a predictive variable is not exactly the same level as which another one rises or falls. If they’re too collinear, they’re giving us the same information, so we wouldn’t need both. (For this example, I’m assuming the variables I picked are not collinear enough to cause a problem, but you should definitely do this check if you’re doing an actual evaluation of a city’s unmet health needs.)

Next, because the different variables are not on the same scale (some are percentages from 0% to 100% and others are money from $0 to infinity), we need to standardize them. It’s a bit complicated to explain in a blog post how we do standardization — and exactly why — so I’ll let the good folks at Crash Course – Statistics give you the full explanation:

Finally, we add up the Z-scores for each variable and use that as the HCI. (Disclaimer: My thesis advisor and doctoral school mentor, Carlos Castillo-Salgado, MD, MPH, DrPH, is one of the developers of the index.) That HCI number is more indicative of the true condition of the neighborhood/community in relation to another. It gives more information all in one time instead of having to compare two or more communities based on one variable at a time.

Let’s Look at It in R

So let’s dive in and do this analysis in R. You’ll need your favorite R programming environment, the public health statistics data from Chicago that can be found here, and the shapefiles for Chicago’s 77 communities that can be found here. I’ve created a function in R that will do several things as you’ll see in the code:

  • Extracts the names (characters) of the communities and saves them to use later
  • Extracts the variables (numbers) from the data and saves them to do the next set of operations
  • Calculates the Z-score for each variable (just like the video above describes)
  • Multiplies variables that you choose by negative one (-1) in order to make sure the directionality of the vector is correct. (More on that in a second.)
  • Sums all the Z-scores into the HCI
  • Creates a data frame by reincorporating the community names into the file along with the Z-scores and the index values

At the end, you’re left with a dataframe that you can match to your shapefile and throw on a map.

Why negative one? Suppose that you’re looking at per capita income and infant mortality. The scale for per capita income is basically “larger is better,” right? The more per capita income, the better the community is in that respect. The scale for infant mortality rate is the opposite. It is “smaller is better.” The lower the infant mortality rate is, the better the community is in that respect. To make the two be heading in the right direction, once the Z-score is calculated, we multiply any variable whose direction is “lower is better” by negative one to flip its directionality. That way, negative Z-scores become positive, or “higher is better.”

Alright, so let’s do some programming. Fire up a new script, and pull in your data and do some clean-up:

# Import the data

chicago <- read.csv("data/chicago_public_health_stats_2013.csv",
                    stringsAsFactors = T)

# Get rid of community area number

drops <- c("Community.Area")
chicago <- chicago[,!(names(chicago) %in% drops)]

# Choose your five indicators and keep them along with the Community Area Name
# For this example, I will choose the following from their respective domains...
# Demographics: No.High.School.Diploma, lower is better*
# Socioeconomics: Per.Capita.Income, higher is better
# Mortality: Infant.Mortality.Rate, lower is better*
# Morbidity and Risk Factors: Tuberculosis, lower is better*
# Coverage and Health Services:, higher is better

keeps <- c(
chicago <- chicago[keeps]

Note that I marked in the comments which variables are “lower is better” in their directionality. Next, we’re going to apply the function I created, “uhn.index” (UHN meaning “unmet health needs”):

# Now, we create a function "uhn.index" that will do several things
# noted in the code:

lower.better <- c("No.High.School.Diploma",

uhn.index <-
  function(x, y) {
    # x is the dataframe we're using, y are the "lower is better" variables
    chars <-
      x %>% select_if(negate(is.numeric)) # Extracts the names of the community areas
    x <- x[, sapply(x, is.numeric)] # Extracts the numeric variables
    x = scale(x) # Calculates the Z scores of teh numeric variables
    myvars <- y # Incorporates the variables in y above as the ones
    x[, myvars] <-
      x[, myvars] * -1 # Multiplies the "lower is better" variables by negative one
    index <- rowSums(x) # Sums all the Z scores into the index score
    x <-
      data.frame(chars, x, index) # Creates the dataframe from the names, the Z scores and the index
    x <-
      x[order(index), ] # Sorts the index variable from smallest to largest
    return(x) # Returns the resulting dataframe

# Now we apply that function to the Chicago dataframe

chicago.uhn <- uhn.index(chicago, lower.better)

# Let's see our final product


At the end, you will have a dataframe named chicago.uhn that will have the names of the communities, the variables you selected from the five domains (noting that the original file had 29 variables).

Next, we import the shapefile:

# Import the shapefile for the community areas of Chicago <-
  readOGR("data/chicago_community_areas", "chicago_community_areas")
How the shapefile overlays the City of Chicago

Next, join the data you created and the original numbers. There was a little problem here because the original data had the community names in upper and lowercase. The shapefile had the names all in uppercase. So I had to do some manipulation before the join:

# Joining our chicago.uhn data with the shapefile

# First, we sort by Community.Area.Name

chicago.uhn <- chicago.uhn[with(chicago.uhn, order(Community.Area.Name)),]

# Next, because the community names in the shapefile are all in capitals...

chicago.uhn <- mutate_at(chicago.uhn, "Community.Area.Name", toupper)
chicago <- mutate_at(chicago, "Community.Area.Name", toupper)

# Now, the join... <-
           how = "left")

# And another join for the original data <-
  geo_join(, chicago, "community", "Community.Area.Name", how = "left")

That’s it. You can now create your maps. For sake of simplicity, I will give you the code below for mapping the index:

# Let's map the Index of Unmet Health Needs

index <- tm_shape(, unit = "mi") +
  tm_basemap("OpenTopoMap") +
    title = "Index of Unmet Health Needs",
    midpoint = 0,
    style = "quantile",
    palette = "RdYlGn" # The minus sign before "RdYlGn" takes care of the direction of colors
  ) +
  tmap_options(legend.text.color = "black") +
             lwd = .5) +
  tm_compass(position = c("left", "center")) +
    compass.type = "4star",
    legend.text.size = 0.9,
    legend.title.size = 0.9
  ) +
    position = c("left", "bottom"),
    frame = F,
    bg.color = "white"
  ) +
          size = 0.3) +
    size = 0.5,
    color.dark = "black",
    color.light = "white",
    lwd = 1,
    position = c("left", "center")

index # Look at the map

So let’s look at the maps of the variables first. Side-by-side, we’ll have the original data on the left and the standardized (Z-scores) values on the right:

Click to enlarge

They look basically the same, and that’s okay. What Z-scores do is that they tell us which communities are above average (positive Z-scores) or below average (negative Z-scores) and by how much. So that’s “No High School Diploma.”

Per capita income:

Click to enlarge

Note here that some of the communities in green on the left map become yellows on the right. This is because, although they are wealthy, they are not that far above the average for wealth. Remember that averages are very much influenced by extremes, and there are several communities in Chicago that are very low in per capita income. (Riverdale at the most south end of Chicago is an astounding $8,535 while Near North Side is $87,163.)

Infant mortality rate:

Click to enlarge


Click to enlarge

Prenatal care:

Click to enlarge

This one was interesting. Note how some of the communities that are relatively wealthy fare below the average when it comes to prenatal care. This is why giving only one map as information on the health status of a city can be misleading. Here, we have many places where prenatal care is around 60%, and very few where it is above 90%. Even when standardized, the number of communities below average (Z-score below zero) are distributed into even the more wealthier communities in the northeast part of the city.

Finally, the index…

Click to enlarge

This index is aimed at giving a clearer picture of the gradation of health needs in the city of Chicago in 2013. Remember that it is the sum of the Z-scores for the other indicators, so it kind of evens out the differences in the indicators’ distributions from one community to the other. Now you can identify the communities that have the most needs across different domains.

Of course, there are other ways to standardize and account for the differences in the scales of the variables and the differences in the variables between communities. The main takeaway is that one variable, presented on one map, may not be giving the full picture of what is going on. Even two variables on two maps may not do the trick. The best thing to do is to get variables from different domains and standardize them so you’re comparing apples to apples.

Epi 101: Indirect Age Adjustment by Hand and in R

Last time, we talked about direct age adjustment. In direct age adjustment, you know what the age-specific death counts and rates are. Because you’re comparing two populations, you need to account for the differences in the populations’ age distributions. For that, we use a reference population to which the two (or more) populations in question are compared to.

In indirect age adjustment, you have a population for which you have a total number of events (e.g. deaths), but you don’t have the age-specific counts. Something like this:

Age GroupPopulationDeaths
Over 7550,000Unknown
Total Deaths:2,550

This is rarely the case, right? We usually have the ages of the people who die. But this could be the case if, for example, the age-specific counts are unreliable or too small. Or if the total number is very much an estimate than an actual headcount.

Like with direct age adjustment, we get a standard population and add its age-specific population and age-specific deaths to our table:

Age GroupPopulationDeathsStandard PopulationStd. Pop.
Over 7550,000Unknown28,0004,200
Total Deaths:2,550Total Deaths:12,050

Next, we get the age-specific death rate from the population, and multiply that by the age-specific population counts of the population in question:

Age GroupPopulationDeathsStandard PopulationStd. Pop.
Std. Pop.
Expected Deaths
Over 7550,000Unknown28,0004,2000.157,500
Total Deaths:2,550Total Deaths:12,050Total Expected Deaths:13,120

As you can see, the expected deaths (13,120) is much higher than the observed deaths (2,550). The Standardized Mortality Ratio (SMR) is 2,550 divided by 13,120, which equals 0.194. This tells us that there are about 20% the number of expected deaths. This could be a good thing, meaning that your population is exhibiting less deaths than the standard population. Or it could mean that, just like you don’t have age-specific death counts, you might be missing out on other deaths. It’s up to you to interpret this.

One final bit of math is the indirectly standardized rate, where you use the SMR above and multiply it by the standard population’s crude death rate (12,050/150,000, which is 0.080). Multiplying 0.194 times 0.080 gives you 0.0156, or about 15.6 deaths per 1,000 people. This is the death rate (also called the Adjusted Mortality Rate, AMR) that you would see in the population in question once you remove the effect of age. The crude rate of the population in question is 2,550/136,00, which equals 0.01875, or 18.8 deaths per 1,000 people.

What About R?

Just like we did last time with direct age standardization, let’s fire up that RStudio and start by writing which libraries we want to use:


Next, we create our data:

age_groups <- c("0-15", "16-35", "36-55", "56-75", ">75") # Labels for Age Groups
population <- c(15000,20000,21000,30000,50000) # Population by age group of population of interest
observed <- 2550 # Observed deaths in the population of interest
std_pop <- c(20000,30000,35000,37000, 28000) # Population by age group of Standard Population
std_pop_sum <- sum(std_pop) # Total population in population of interest
std_count <- c(200,1500,2450,3700,4200) # Observed deaths in standard population
std_sum_count <- sum(std_count) # Total observed deaths in standard population
std_rate <- std_count/std_pop # Age specific death rates in standard population
std_crude_rate <- std_sum_count/std_pop_sum # Crude death rate in the standard population

Now, we create a data frame “df” and put the data we created above into it. We also create an additional variable “expected” and fill that in by multiplying the standard population’s age-specific rates by the population in question:

df <- data.frame(age_groups,population,std_pop,std_count, std_rate)
df$expected <- std_rate*population

Finally, two quick operations to calculate the SMR and the AMR:

smr <- observed/sum(df$expected) # Standardized mortality ratio
amr <- smr*std_crude_rate*1000 # Adjusted mortality rate per 1,000

Finally, a quick way to do all of the above and get some 95% confidence intervals with them:

indirect <- ageadjust.indirect(observed, # The number of events observed in the population of interest
                         population, # The population by age group in the population of interest
                         std_count, # The number of events observed in the standard population
                         std_pop, # The population by age group in the standard population
                         stdrate = std_rate, # The age specific death rates in the standard population
                         conf.level = 0.95) # The confidence level for your confidence intervals

round(indirect$sir,3) # Standardized Mortality Ratio (sir)
round(indirect$rate*1000,2) # Adjusted Mortality Rate per 1,000 (adj.rate)

From this, you get the same values as we did above (both by hand and with R), but you also get the 95% confidence intervals:

> round(indirect$sir,3) # Standardized Mortality Ratio (sir)
 observed       exp       sir       lci       uci 
 2550.000 13120.000     0.194     0.187     0.202 
> round(indirect$rate*1000,2) # Adjusted Mortality Rate per 1,000 (adj.rate)
crude.rate   adj.rate        lci        uci 
     18.75      15.61      15.02      16.23 

And there you have it. Two ways to do this in R. You can now save the R file (or download the file I used by clicking here). If you have a lot of data, and you don’t want to write out the individual vectors at the beginning, then I suggest you learn the basics of importing data into R and then use that with the packages mentioned above.

Big thanks to these folks and these folks for their examples on how to use dsr and epitools, respectively.

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

The Canaries in the Coalmine

The old adage in epidemiology and biostatistics that “the plural of anecdotes is not data.” What we mean by this is that the unverified experiences of a group of people may not always be the truth of what is happening in the population at large. Certainly, I use this point of view when it comes to the anti-vaccine crowd and their claims that vaccines turn you into the Hulk or whatever.

“Doveryai, no proveryai” – Old Russian Proverb

When Andrew Wakefield released his paper in 1998, he claimed that it was his gut feeling that the MMR vaccine caused autism. (The paper, however, found no evidence of this.) HIs statements made researchers jump to action and look into vaccines as a cause for autism because, at the time, the idea that vaccines could do this was a novel idea. (Even if the plausibility was, at best, questionable.) Twenty-plus years later, we know that vaccines do not cause autism, and neither did the preservative called thimerosal that used to be in most childhood vaccines (not the MMR) and is now only found in some multi-dose vials of the influenza vaccine.

In essence, we verified Wakefield’s fraudulent claims, and we’re a little better for it in terms of our knowledge (and acceptance) of autism. We’re not doing so well on the anti-vaccine front, but there are some signs that the tide is turning against their misleading ways. Not only that, but vaccine safety and effectiveness surveillance is better now than it ever has been because we went through the trouble of verifying the fraud perpetrated by Wakefield and amplified by his followers.

The Woman in the Hospital Gown, and Not Much More

About a year ago, a bystander at a bus stop across from a hospital in Baltimore recorded a woman in a hospital gown, and not much else, as the woman muttered words to herself and seemed in some sort of general disorientation. She was taken back to the hospital after 911 was called and help arrived. As it turns out, someone released her from the hospital and put her at the bus stop although she was in some sort of a mental health crisis.

The question in my mind at the time (and in the mind of others) was how often this happens, to whom, and why it happens. Is she the canary in the coal mine when it comes to how people with mental health issues are treated by healthcare providers and facilities? Or was she just the unfortunate recipient of a clerical error that landed her outside in a gown in the middle of winter?

The Brother Pearlie, Sheamon and Damon

The first homicide of 2017 was that of 20 year old Sheamon Pearlie. A year and a half later, his brother Damon was killed just a few blocks away from where Sheamon was killed.

Just a short walk from one place to the other.

By themselves, these two homicides are just a couple of data points in what is a very big and very sad dataset coming out of Baltimore City. However, put them together, and you get the idea of what is happening in the city with regards to where homicides are happening and who is being affected the most by them: African American men between 15 and 35 years of age living/working/existing in poor and disordered neighborhoods in the city. That is, they tell a bigger story.

The Little Lors

One of the first versions of my doctoral thesis work was going to look at the social network of victims of homicide in Baltimore. To do this, I was planning on going to social media and seeing if one victim was associated with another through their social media interactions. (For example, were they friends on Facebook?) So I worked on a proof of concept before proposing the work to my advisor.

For a few weeks in 2015, I tracked homicides happening in Baltimore almost as they happened. As soon as I saw that there had been a homicide, I searched social media to see who was talking about it. Almost inevitably, someone would mention the name of the victim and link to the victim’s social media profile on Facebook or Twitter, or even SnapChat. Later, when another victim was identified and their social media identity revealed, I would compare the profiles and see if they knew each other.

Little by little, over the course of about six months, I discovered that a handful of young men knew each other on Facebook. They all identified each other with the moniker of “Lor” followed by a nickname based on their real name or how they were known in their circle(s) of friends. There were only a few of them, but they all were called “Lor” Something by their friends and on their social media accounts. I nicknamed them “The Little Lors.” Little, because they were all so young.

Again, questions started to be asked in my head. What were the Little Lors a symptom of in the city? Most were still school-aged, so why were they not in school? And those who were out of school, had they graduated high school and were they employed? I wouldn’t get the answers to these questions because that aspect of my research was not approved to go forward. (It was deemed too intrusive to go look at their social media profiles, even if the profiles were open to the public to see.)

Those Nagging Questions

Many of those questions nag me. While I’m not directly involved in solving the problems that led to the woman at the bus stop or the siblings and friends being killed, those “canaries in the coal mine” are relevant to me as someone involved in public health. They’re valuable pieces of an interesting puzzle that, when completely elucidated, will give some idea of the level of disparity and institutionalized discrimination and racism toward entire groups of people.

People with mental health issues are abandoned to their own fortune because the fallible systems created by fallible humans are not able to serve them for some reason. When one sibling is killed in a neighborhood, the other was destined for the same fate perhaps because there was no intervention available to the second sibling to protect them from the contagion that is violence. What intervention could have worked?

And so on and so forth…

Early Warning or More of the Same?

When I told the story of the brothers getting killed, one of my colleagues asked if they were an early warning of something yet to come or just another set of statistics in what we already know about Baltimore. My response was that they were both. On the one hand, yes, there are a lot of homicides in Baltimore, and they have shown little signs of slowing down of the rate at which they’re happening.

On the other hand, when the one Pearlie was killed, there should have been a mobilization of all available resources to make sure that his sibling had the necessary resources to not end up the same way. Because, if violence is contagious (as it has been shown), then who could be at most risk of more violence than the next-of-kin of someone killed via homicide? At the same time, the fact that this was not done is an indicator of where we (all of us) could do better.

The same is true with the woman at the bus stop. She would be the harbinger of things to come if she were the exception, and more attention should/could be placed on how people with mental health problems are discharged from medical facilities. And, if she is not the exception, then we can use her experience to begin to understand what is happening… And do something about it.

Turning Anecdotes Into Data

So now comes the $64,000 question: How do we turn these anecdotes into data that is actionable? In the case of the woman at the bus stop, I start looking at medical records of people with mental health issues and try to understand the habits and procedures of the health care providers when it comes to where those patients are discharged, when, and if any follow-up is done. In the case of the siblings, I collect information on other homicides and what kind of follow-up and services were offered and delivered to the victims’ next-of-kin. Same with the Little Lors.

Then I step back and analyze the information, consult with experts, and put forward proposals to stop the next woman from being sent on her own into the winter’s cold in a hospital gown, or to deliver services to the next-of-kin and friends of a homicide victim to stop — or at least slow — that contagion right then and there. Then, of course, we double-back and analyze the interventions and re-assess the approach.

Better Said Than Done

This is all better said than done, right? There are a lot of moving parts to catching these anecdotes, verifying them and then going out into the world of data that is out there to measure and intervene to prevent what these anecdotes are forecasting. That doesn’t mean we don’t try.

The minute Wakefield put forth his fraudulent “study,” it scared parents away from vaccinating. Not all parents, but enough parents to cause vaccine-preventable diseases to make a comeback. The minute someone heard him say what he said in that press conference twenty years ago, someone else should have jumped higher and shouted louder with data in hand, both verifying the plausibility and probability of vaccines causing autism and then doing something to prevent parents from stopping vaccines for their children.

Unfortunately, too many were afraid to be divisive, so they chose to be indecisive. And here we are today… This can’t continue to be the case with the social ailments and public health threats in a city like Baltimore, or anywhere else. If we see a problem, we point it out, we science the heck out of it, and we make sure that the anecdote remains an anecdote. And, if it isn’t an anecdote but a widespread problem, we take the decisive action needed to stop and reverse whatever perverse trend threatens others.

Don’t Get Rid of Your Outliers Just Because

I was tutoring a student last week on how to conduct the analysis of some data they’re using for their final project. Their project is all about analysing a health intervention in women at risk for gestational diabetes. The measurement of the success of the intervention was the number of women who had gestational diabetes and the birth weight of the children born to all the women at the place where the intervention was done.

We were doing well in the planning for the analysis until the student mentioned something very interesting… They had gotten rid of all their outliers when it came to birth weight. “You what?” I asked.
“I looked for all the outliers in the birth weight data and got rid of them.”
“How did you do that?”
“I made a box plot and just got rid of all the outliers.”

If you’re wondering why this concerned me so much, it’s because women who experience gestational diabetes are at higher risk of having a large (>9 lbs) baby. In the student’s data, the number and proportion of women with gestational diabetes at a clinic did change from before the program to after, and so had the average weight of babies born. There were still some women who had gestational diabetes, and there were still some macrosomatic babies.

“Tell me, if high birth weight is one of your indicators, why did you get rid of the very high birth weights?” I asked the student. It was then that a lightbulb went off in their head.
“Oh… I shouldn’t have.”
“No,” I replied, “because you’re biasing your results.”

What made things worse is that the student had done their analysis on Excel, without keeping notes on what they had done. As a result, they didn’t know which records they deleted. To figure that out and undo it, they had to go back to the original dataset and compare it to what they had already, or just start all over. “I can’t start all over,” they said. “I don’t have the time. This is due in two weeks.”

Oh, boy.

One of the advantages of a statistics package like R, STATA or SAS is that your original data remains intact in the background while you play around with it. You also keep track of what you’re doing by writing a “program” that you can go back to and edit as necessary. It’s like writing a story for your data as you bring it from pure raw data form to something you are analyzing, to the tables and graphs you’re going to publish.

Yes, Excel does some great graphs, and it’s easy to build them, but you lose track of what you’re doing if you just point and click. It’s better to have a running record of what you’re doing. So, for the sake of your research, keep accurate notes if you’re going to use something that is not based on a program written by you.

One More Thing

One thing that I can’t figure out is whether or not someone told this student what would happen if they deleted outliers without thinking about it. Just one poorly deleted outlier can change the entire nature of the association between the dependent variable and the independent variable(s). You can’t just see a variable that is an outlier and delete it, especially if it is a variable that is associated so strongly with your outcome.

This leads me to believe that one of two things happened. Either the student was never taught this aspect of data analysis and interpretation, or they outright refused to do it… Or they were not paying attention. At any rate, someone other than an online tutor needs to be paying close attention. It doesn’t serve the student to let them go so far astray.

Poopooing the P-Value

Suppose you are comparing a new medication against an old medication, both designed to lower cholesterol. The old medication lowered cholesterol by an average of 50 points, and the new medication lowers cholesterol by an average of 75 points. You think the new medication is good until you see that the p-value for the comparison of both groups is 0.051, which is larger than the “statistically significant” cutoff of 0.05.

You throw away your conclusion and stay with the old medication right?

Yeah, Maybe

When I was working on my dissertation, I was trying to build a statistical model where the expected number of homicides per resident in any given neighborhood in Baltimore was predicted by one or more variables. I included variables one at a time and saw if they were statistically significant at a p-value of 0.05. Those who were significant would move on to form the bigger model, while those who were not would be placed aside.

If I had done this exactly this way, I would have been left with a model with only two variables: Poverty and Disorder. However, I decided to leave in one variable that was not statistically significant by itself or in combination with other variables: the average number of homicides in neighborhoods bordering the neighborhood in question.

The Black Box of Biostatistics

As I told you before, biostatistics can be somewhat of a black box to those who don’t understand the underlying theory of why numbers come out the way they do. You put something in and you get something out, but you don’t really understand what happened in the process. That’s the definition of a black box.

Not to be confused with 90’s dance group “Black Box,” by the way:

Anyway, not a lot of healthcare providers are trained fully in the theory and practice of biostatistics. They go to one conference after another where papers and studies are presented but only if the studies found some difference between their study groups that was statistically significant. Anything less than that and the study probably doesn’t even see the light of day.

If the study findings do see the light of day, they’re presented without much of the underlying data. (Good luck trying to get any researcher to share “their” data.) So all you’re left with to make a clinical (or public health) decision is what the study says. If it kind of jives with what you know — or believe — to be true, then you go with it. If the findings are revolutionary, you kind of want to see more evidence. All the while, you have no real clue on how or why or if the data said what the researchers said that it did.

You kind of go for it on faith, right?

You’re in Luck!

Lucky for you, biostatisticians alone don’t do all the research, analysis, interpretation and policy decisions. A good research project has a multidisciplinary team. A good paper writes out the origins of the idea for the study, the background information you need, what others have done and the source of the data. And a good clinician/practitioner uses their scientific knowledge, the entirety of the evidence and their own best judgment to decide on whether or not to go with a study’s findings.

What About P?

There’s a nascent movement to change our approach to p-values and similar statistical measurements because they’ve become too difficult to deal with when deciding on a study’s findings. From an article in Nature:

Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.

One reason to avoid such ‘dichotomania’ is that all statistics, including P values and confidence intervals, naturally vary from study to study, and often do so to a surprising degree. In fact, random variation alone can easily lead to large disparities in P values, far beyond falling just to either side of the 0.05 threshold. For example, even if researchers could conduct two perfect replication studies of some genuine effect, each with 80% power (chance) of achieving P < 0.05, it would not be very surprising for one to obtain P < 0.01 and the other P > 0.30. Whether a P value is small or large, caution is warranted.

We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits. In doing so, they should remember that all the values between the interval’s limits are reasonably compatible with the data, given the statistical assumptions used to compute the interval. Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.”

It’s true. People see a p-value of 0.051 and discard the entirety of the study’s findings, as I’ve written above. But I particularly like the part about embracing uncertainty. Although I don’t think that they mean uncertainty as in not knowing something. I think they mean uncertainty as in “change one parameter and everything changes… We can’t measure everything, or everyone, after all.”

Take, for example, what happened with a rotavirus vaccine back in the late 1990s. After the vaccine was licensed, physicians began reporting cases of intussusception, a condition in which the intestine folds in on itself. They reported enough cases to warrant further investigation. (Intussusception is life-threatening if left untreated.) Post-licensure surveillance and case-control studies found that an increased risk of intussusception was associated with getting the vaccine, so the vaccine was withdrawn from the market.

A different vaccine manufacturer was ready to pull the plug on its rotavirus vaccine, but a scientist within the company convinced the bosses to go ahead. He was that confident of his vaccine. However, in order to avoid not seeing an adverse event when there is one, they were going to need thousands more children in the clinical trial… Tens of thousands, actually.

This was because the first clinical trial saw no difference in the two groups: vaccinated and unvaccinated. It saw no difference partly because the size of the groups was not big enough to see cases of intussusception that occur so rarely to begin with. To see just one case of naturally occurring intussusception, you’d have to observe between 17,000 and 100,000 children. So the next trial had to have a lot of children in the control and intervention groups in order to see if there was a difference.

In that second vaccine trial, there was no detected increase in the risk of intussusception between the groups, and the intervention group had less incidence of rotavirus. Still, post-marketing surveillance was very active to try and detect any issues that were similar to the previous vaccine. They haven’t, but… But there is still an increased risk of intussusception if you get rotavirus or the vaccine, compared to not getting either. (But that’s a discussion for a later time.)

See What I Mean?

On the one hand, giving a dichotomous (yes/no) value to a p-value or a confidence interval makes it easy to make a decision on whether or not the numbers you just crunched make sense. On the other hand, medical and public health decisions are much more complex than just a dichotomous decision. We’re all similar, but we’re not. We all react the same way to medications, for the most part. And it’s in dissecting those dissimilarities where get in trouble over the dichotomous nature of how we interpret p-values.

Maybe in the future there will be a better way to make decisions on statistical analyses than to just reject or fail-to-reject the null and alternative hypotheses… Until then, decision makers need to be better equipped to understand how the data on which they’re basing their decisions was collected, analyzed and why/how he statistics that were computed came out the way they did.