Local Indicators of Spatial Association in Homicides in Baltimore, 2019

Last year was yet another record year for homicides in Baltimore with 348 reported homicides for the year. While this number was the second-highest in terms of total homicides, it was probably the highest in terms of rate per 100,000 residents since Baltimore has been seeing a decrease in population. (The previous record was in 2015 and 2017.) The epidemic of violence and homicides that began in 2015 continues today.

When we epidemiologists analyze data, we need to be mindful of the spatial associations we see. The first law of geography states that “everything is related to everything else, but near things are more related than distant things.” As a result, clusters of cases of a disease or condition — or an outcome such as homicides — may be more a factor of similar things happening in close proximity to each other than an actual cluster.

To study this phenomenon and account for it in our analyses, we use different techniques to assess spatial autocorrelation and spatial dependence. In this blog post, I will guide you through doing a LISA (Local Indicators of Spatial Association) analysis of homicides in Baltimore City in 2019 using R programming.

First, The Data

Data on homicides was extracted from the Baltimore City Open Data portal. Under the “Public Safety” category, we find a data set titled “BPD Part 1 Victim Based Crime Data.” In those data, I filtered the observations to be only homicides and only those occurring/reported in 2019. This yielded 348 records, corresponding to the number reported for the year.

Data on Community Statistical Areas (agglomerations of neighborhoods based on similar neighborhood characteristics came from the Baltimore Neighborhood Indicators Alliance (BNIA). They are a very good source of data on Baltimore with regards to social and demographic indicators. From their open data portal, I extracted the shapefile of CSAs in the city.

Next, The R Code

First, I load the libraries that I’m going to use for this project. (I’ve tried to annotate the code as much as possible.)

library(tidyverse)
library(rgdal)
library(ggmap)
library(spatialEco)
library(tmap)
library(tigris)
library(spdep)
library(classInt)
library(gstat)
library(maptools)

Next, I load the data.

csas <-
  readOGR("2017_csas",
          "Vital_Signs_17_Census_Demographics") # Baltimore Community Statistical Areas
jail <-
  readOGR("Community_Statistical_Area",
          "community_statistical_area") # For the central jail
pro <-
  sp::CRS("+proj=longlat +datum=WGS84 +unit=us-ft") # Projection
csas <-
  sp::spTransform(csas,
                  pro) # Setting the projection for the CSAs shapefile
homicides <- read.csv("baltimore_homicides.csv",
                      stringsAsFactors = F) # Homicides in Baltimore
homicides <-
  as.data.frame(lapply(homicides, function(x)
    x[!is.na(homicides$Latitude)])) # Remove crimes without spatial data (i.e. addresses)

Note that I had to do some wrangling here. First, there is an area on the Baltimore Map that corresponds to a large jail complex in the center of the city. So I added a second map layer called jail to the data. I also corrected the projection on the layers. (This helps make sure that the points and polygons line up correctly since you’re overlaying a flat piece of data on a spherical part of the planet.) Finally, I took out homicides for which there was no spatial data. This sometimes happens when the true location of a homicide cannot be determined… Or when there is a lag/gap in data entry.

Next, I’m going to convert the homicides data frame into a spatial points file.

coordinates(homicides) <-
  ~ Longitude + Latitude # Makes the homicides CSV file into a spatial points file
proj4string(homicides) <-
  CRS("+proj=longlat +datum=WGS84 +unit=us-ft") # Give the right projection to the homicides spatial points

Note that I am telling R which variables in homicides corresponds to the longitude and latitude variables. Next, as I did before, I corrected the projection to match that of the CSAs layer.

Now, let’s join the points (homicides) to the polygons (csas).

homicides_csas <- point.in.poly(homicides,
                                csas) # Gives each homicide the attributes of the CSA it happened in
homicide.counts <-
  as.data.frame(homicides_csas) # Creates the table to see how many homicides in each CSA
counts <-
  homicide.counts %>%
  group_by(CSA2010, tpop10) %>%
  summarise(homicides = n()) # Determines the number of homicides in each CSA.
counts$tpop10 <-
  as.numeric(levels(counts$tpop10))[counts$tpop10] # Fixes tpop10 not being numeric
counts$rate <-
  (counts$homicides / counts$tpop10) * 100000 # Calculates the homicide rate per 100,000
map.counts <- geo_join(csas,
                       counts,
                       "CSA2010",
                       "CSA2010",
                       how = "left") # Joins the counts to the CSA shapefile
map.counts$homicides[is.na(map.counts$homicides)] <-
  0 # Turns all the NA's homicides into zeroes
map.counts$rate[is.na(map.counts$rate)] <-
  0 # Turns all the NA's rates into zeroes
jail <-
  jail[jail$NEIG55ID == 93, ] # Subsets only the jail from the CSA shapefile

Here, I also created a table to see how many points fell inside the polygons. I allowed me to look at how many homicides were in each of the CSAs. Here are the top five:

Community Statistical AreaHomicide Count
Southwest Baltimore29
Sandtown-Winchester/Harlem Park25
Greater Rosemont21
Clifton-Berea16
Greenmount East15

Of course, case counts alone are not giving us the whole information. We need to account for the differences in population numbers. For that, you’ll see that I created a rate variable. Here are the top five CSAs by homicide rate per 100,000 residents:

Community Statistical AreaHomicide Rate per 100,000 Residents
Greenmount East191
Sandtown-Winchester/Harlem Park189
Clifton-Berea176
Southwest Baltimore172
Madison/East End167

As you can see, the ranking based on homicide rates is different, but it is more informative. It accounts for the fact that you may see more homicides in a CSA simply because that CSA has more people in it. Now we have the rates. I then joined the counts and rates to the csas shapefile. This is what we’ll use to map. I also made sure that all the NA’s in the rates and counts are zeroes. Finally, I used only the CSA labeled “93” for the jail. This, again, is to make sure that the map represents that area and we know that it is not a place where people live.

The Maps

Look at the following simple plot of homicide locations in Baltimore:

Notice Something?

This is not a map. This is just a plot of the homicide locations. The latitude is on the Y axis and the longitude on the X axis. If you notice that there is some clustering of cases, you would be correct. Point-based cluster analysis can also be done on these data alone, but that is for a different blog post at a later time. (And you have to take other things into consideration.)

Next, I’m going to create a map of the points above (using the tmap package) but also show the boundaries of the CSAs. This will look more like a map and show you the spatial distribution of the homicide cases in context. You’ll see why there are blank spaces in some parts.

Here is the code:

tmap_mode("plot") # Set tmap to plot. To make interactive, use "view"

homicide.dots.map <-
  tm_shape(map.counts) +
  tm_borders(col = "black",
             lwd = 0.5) +
  tm_text("OBJECTID",
          size = 1) +
  tm_shape(homicides) +
  tm_dots(col = "red",
          title = "2019 Homicides",
          size = 0.2, ) +
  tm_compass() +
  tm_layout(
    main.title = "Map of Homicides and Homicide Rate in Baltimore City, 2019",
    main.title.size = 0.8,
    legend.position = c("left", "bottom"),
    compass.type = "4star",
    legend.text.size = 1,
    legend.title.size = 1,
    legend.outside = T
  ) +
  tm_scale_bar(
    text.size = 0.5,
    color.dark = "black",
    color.light = "yellow",
    lwd = 1
  ) +
  tm_add_legend(
    type = "symbol",
    col = "red",
    shape = 21,
    title = "Homicide",
    size = 0.5
  ) +
  tmap_options(unit = "mi") +
  tm_add_legend(
    type = "text",
    col = "black",
    title = "CSA ID#",
    text = "00"
  )
homicide.dots.map

And here is the map:

Click to enlarge.

Again, we see the clustering of homicides in some areas. Let’s look at homicide rates. First, the code:

homicide.rate.map <-
  tm_shape(map.counts) +
  tm_fill(
    "rate",
    title = "Homicide Rate per 100,000",
    style = "fisher",
    palette = "OrRd",
    colorNA = "white",
    textNA = "No Homicides"
  ) +
  tm_borders(col = "black",
             lwd = 0.5) +
  tm_text("OBJECTID",
          size = 1) +
  tm_shape(homicides) +
  tm_dots(col = "red",
          title = "2019 Homicides") +
  tm_shape(jail) +
  tm_fill(col = "#d0e2ec",
          title = "Jail") +
  tm_compass() +
  tm_layout(
    main.title = "Map of Homicides and Homicide Rate in Baltimore City, 2019",
    main.title.size = 0.8,
    legend.position = c("left", "bottom"),
    compass.type = "4star",
    legend.text.size = 1,
    legend.title.size = 1,
    bg.color = "#d0e2ec",
    legend.outside = T
  ) +
  tm_scale_bar(
    text.size = 0.5,
    color.dark = "black",
    color.light = "yellow",
    lwd = 1
  ) +
  tm_add_legend(
    type = "symbol",
    col = "red",
    shape = 21,
    title = "Homicide",
    size = 0.5
  ) +
  tm_add_legend(
    type = "text",
    col = "black",
    title = "CSA ID#",
    text = "00"
  ) +
  tmap_options(unit = "mi")
homicide.rate.map

It stands to reason that the CSAs with the most points would also have the highest rates, so here is the map:

Click to enlarge.

From here, we see that the CSAs with the highest rates are close together. We also see that those with the lowest rates are close together. And others are somewhere in between. If there was no spatial autocorrelation, we would expect these CSAs with high or low rates to be equally distributed all over the map. If there was perfect autocorrelations, all highs would be on one side and all lows would be on the other. What we have here is something in the middle, as is the case in most real-life examples.

LISA Analysis

In the next few lines of code, I’m going to do several things. First, I’m going to create a list of the neighbors to each CSA. One CSA, #4 at the far southern tip of the city, does not touch other CSAs. It looks like an island. So I’m going to manually add its neighbors to the list.

Next, I’m going to create a list of neighbors and weights. By weights, I mean that each neighbor will “weigh” the same. If a CSA has one neighbor, then the weight of that neighbor is 1. If they have four, each neighbor — to that CSA — weighs 0.25.

Next, I’m going to calculate the Local Moran’s I value. For a full discussion of Moran’s I, I recommend you visit this page: http://ceadserv1.nku.edu/longa//geomed/ppa/doc/LocalI/LocalI.htm
There are a lot of nuances to that value. But, basically, that value tells us how different from no spatial autocorrelation each CSAs is. Once you see it on the map, you’ll understand better.

Here’s the code:

map_nbq <-
  poly2nb(map.counts) # Creates list of neighbors to each CSA
added <-
  as.integer(c(7, 50)) # Adds neighbors 7 and 50 to CSA 4 (it looks like an island)
map_nbq[[4]] <- added # Fixes the region (4) without neighbors
View(map_nbq) # View to make sure CSA 4 has neighbors 7 and 50.

map_nbq_w <-
  nb2listw(map_nbq) # Creates list of neighbors and weights. Weight = 1/number of neighbors.
View(map_nbq_w) # View the list

local.moran <-
  localmoran(map.counts$rate, # Which variable you'll use for the autocorrelation
             map_nbq_w, # The list of weighted neighbors
             zero.policy = T) # Calculate local Moran's I
local.moran <- as.data.frame(local.moran)
summary(local.moran) # Look at the summary of Moran's I

map.counts$srate <-
  scale(map.counts$rate) # save to a new column the standardized rate

map.counts$lag_srate <- lag.listw(map_nbq_w,
                                  map.counts$srate) # Spatially lagged standardized rate in new column

summary(map.counts$srate) # Summary of the srate variable
summary(map.counts$lag_srate) # Summary of the lag_srate variable

Now, we’ll take a look at the Moran’s I plot created with this code:

x <- map.counts$srate %>% as.vector()
y <- map.counts$lag_srate %>% as.vector()
xx <- data.frame(x, y) # Matrix of the variables we just created

moran.plot(x, map_nbq_w) # One way to make a Moran Plot

And here is the plot:

Click to enlarge.

This plot tells us which CSAs are in the High-High (right upper quadrant), High-Low (right lower quadrant), Low-Low (left lower quadrant) and Low-High (left upper quadrant) designations for Moran’s I values. The High-High CSAs mean that their value was high and the values of the CSAs (their neighbors) were high also. That is expected since everything is like everything else close to it, right?

The Low-Low CSAs are the same. The ones that are outlier in this regard are the High-Low and Low-High, because these are CSAs that are high values (by values, we are looking at a calculation derived from standardized homicide rates) in a sea of low values, or low values in a sea of high values. The plot above has highlighted the ones that are not spatial outliers and have a significant p-value. (More on the p-value later.)

There is another way to make the same plot, if you’re so inclined:

plot(
  x = map.counts$srate,
  y = map.counts$lag_srate,
  main = "Moran Scatterplot Homicide Rate",
  xlab = "Homicide Rate (Scaled)",
  ylab = "Lagged Homicides Rate (Scaled)"
) # Another way to make a Moran Plot
abline(h = 0,
       v = 0) # Adds the crosshairs at (0,0)
abline(
  lm(map.counts$lag_srate ~ map.counts$srate),
  lty = 3,
  lwd = 4,
  col = "red"
) # Adds a red dotted line to show the line of best fit

And this is the resulting plot from that:

Click to enlarge.

So those are the plots of the Moran’s I value. What about a map of these values? For that, we need to prepare our CSAs with the appropriate quadrant designations.

map.counts$quad <-
  NA # Create variable for where the pair falls on the quadrants of the Moran plot
map.counts@data[(map.counts$srate >= 0 &
                   map.counts$lag_srate >= 0), "quad"] <-
  "High-High" # High-High
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate <= 0), "quad"] <-
  "Low-Low" # Low-Low
map.counts@data[(map.counts$srate >= 0 &
                   map.counts$lag_srate <= 0), "quad"] <-
  "High-Low" # High-Low
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate >= 0), "quad"] <-
  "Low-High" # Low-High

This gives us this map:

Click to enlarge.

Here is the code for that map:

local.moran$OBJECTID <- 0 # Creating a new variable
local.moran$OBJECTID <- 1:nrow(local.moran) # Adding an object ID
local.moran$pvalue <-
  round(local.moran$`Pr(z > 0)`, 3) # Rounding the p value to three decimal places

map.counts <- geo_join(map.counts,
                       local.moran,
                       "OBJECTID",
                       "OBJECTID")

colors <-
  c("red", "blue", "lightpink", "skyblue2", "white") # Color Palette

local.moran.map <-
  tm_shape(map.counts) +
  tm_fill("quad",
          title = "Local Moran's I",
          palette = colors,
          colorNA = "white") +
  tm_borders(col = "black",
             lwd = 0.5) +
  tm_text("pvalue",
          size = 0.5) +
  tm_shape(jail) +
  tm_fill(col = "gray",
          title = "Jail") +
  tm_compass() +
  tm_layout(
    main.title = "Map of Local Moran's I for Homicide Rates in Baltimore, 2019",
    main.title.size = 0.8,
    legend.position = c("left", "bottom"),
    compass.type = "4star",
    legend.text.size = 1,
    legend.title.size = 1,
    legend.outside = T,
    bg.color = "#d0e2ec"
  ) +
  tm_scale_bar(
    text.size = 0.5,
    color.dark = "black",
    color.light = "yellow",
    lwd = 1
  ) +
  tm_add_legend(
    type = "text",
    col = "black",
    title = "Moran's I p-value",
    text = "0.000"
  ) +
  tmap_options(unit = "mi")
local.moran.map

The CSAs in red are those that have high Moran’s I values and are surrounded by other CSAs with high values. The light blues are where there are low values and are are surrounded by lows. The dark blue are CSAs with high values but with low neighbors, and the light pink are CSAs with low values but high neighbors. The numbers inside each CSA is the p-value.

The Thing About Significance

If you’ve done your coursework on biostatistics, you’ll see that there is a lot of attention paid to p-values that are less than 0.05. In this case, only a few CSAs have such low p-values. But that may not be enough. In this explainer on spatial association, the author states the following:

An important methodological issue associated with the local spatial autocorrelation statistics is the selection of the p-value cut-off to properly reflect the desired Type I error. Not only are the pseudo p-values not analytical, since they are the result of a computational permutation process, but they also suffer from the problem of multiple comparisons (for a detailed discussion, see de Castro and Singer 2006). The bottom line is that a traditional choice of 0.05 is likely to lead to many false positives, i.e., rejections of the null when in fact it holds.

Source: https://geodacenter.github.io/workbook/6a_local_auto/lab6a.html

For this exercise, we are going to keep the p-value significance at 0.05 and create a map that uses the resulting CSAs. The code:

map.counts$quad_sig <-
  NA # Creates a variable for where the significant pairs fall on the Moran plot
map.counts@data[(map.counts$srate >= 0 &
                   map.counts$lag_srate >= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "High-High, p <0.05" # High-High
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate <= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "Low-Low, p <0.05" # Low-Low
map.counts@data[(map.counts$srate >= 0 &
                   map.counts$lag_srate <= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "High-Low, p <0.05" # High-Low
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate >= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "Low-High, p <0.05" # Low-High
map.counts@data[(map.counts$srate <= 0 &
                   map.counts$lag_srate >= 0) &
                  (local.moran[, 5] <= 0.05), "quad_sig"] <-
  "Not Significant. p>0.05" # Non-significant

Note that I am using the p-value in the local.moran calculation (position #5) in addition to where on the moran plot the CSAs fall. This really eliminates the non-significant (at p = 0.05) CSAs and leaves us with only eight CSAs. Here is the code for the map:

local.moran.map.sig <-
  tm_shape(map.counts) +
  tm_fill(
    "quad_sig",
    title = "Local Moran's I",
    palette = colors,
    colorNA = "white",
    textNA = "Not Significant"
  ) +
  tm_borders(col = "black",
             lwd = 0.5) +
  tm_text("pvalue",
          size = 0.5) +
  tm_shape(jail) +
  tm_fill(col = "gray",
          title = "Jail") +
  tm_compass() +
  tm_layout(
    main.title = "Map of Local Moran's I for Homicide Rates in Baltimore, 2019",
    main.title.size = 0.8,
    legend.position = c("left", "bottom"),
    compass.type = "4star",
    legend.text.size = 1,
    legend.title.size = 1,
    legend.outside = T,
    bg.color = "#d0e2ec"
  ) +
  tm_scale_bar(
    text.size = 0.5,
    color.dark = "black",
    color.light = "yellow",
    lwd = 1
  ) +
  tm_add_legend(
    type = "text",
    col = "black",
    title = "Moran's I p-value",
    text = "0.000"
  ) +
  tmap_options(unit = "mi")
local.moran.map.sig

And here is the map itself:

Click to enlarge.

What this is telling us is that only those eight CSAs had a statistically significant Moran’s I value, and they were all in the “High-High” classification. This means that they have high standardized values (which are themselves derived from homicide rates) and are surrounded by CSAs with high values. There were no significant CSAs from the other quadrants on the Moran plot. (Remember that significance here comes with a big caveat.)

In Closing

What does this all mean? It means that the “butterfly pattern” seen in so many maps of social and health indicators in Baltimore still holds. The two regions in red in that last map make up the butterfly’s wings. Those CSAs are also the ones with the highest homicide rates.

Click to enlarge.

You could say that it is in those eight CSAs where homicides are the most concentrated. These are known as spatial clusters. If we had significant “Low-High” or “High-Low” CSAs, they would be known as spatial outliers. Significant “Low-Low” CSAs would also be spatial clusters.

As I mentioned above, we can also do other analyses, some of them looking at the individual locations of the homicides, the dots in the first plot and the first map. You could generate statistics that tell us whether those dots are closer together to each other than what you would normally see if they were just tossed on the map randomly.

A full geostatistical analysis includes these analyses, and you can see an example of the whole thing if you look at my dissertation. You may want to include the Local Moran’s I value or some other term (I used average number of homicides in neighbors) in a regression analysis. Or you could use composite index to account for different factors.

Looking at Unmet Health Needs in Chicago, 2013

Look at this map of Chicago.

Click to enlarge

This map shows the infant mortality rate in the 77 different communities (except O’Hare, which is the airport area) in Chicago. As you can see, there are some very marked disparities. Armour Square, on the city’s east side, has the lowest infant mortality rate at 1.5. Right next to it is Douglas, and it’s infant mortality rate is 13.4. Interesting, right?

(All the data for this blog post come from Chicago’s Open Data portal.)

You would think that maybe socioeconomics has something to do with this, and you’d be partially right. However, the per capita income in Armour Square is $16,942. In Douglas, it’s $23,098. The percent of people living under the poverty line in Armour Square is 35.8%. In Douglas, it’s 26.1%.

So, if I were to show you only the map of infant mortality, I would not be informing you completely on the disparities between Armour Square and its neighboring community, Douglas. So maybe I’ll show you a second map, that of per capita income:

Click to enlarge.

Ah, now we see that Armour Square and Douglas are on the lower end of the ladder when it comes to per capita income. The higher-income communities show clearly in green. However, if I show you only this map, or even these two maps, I’m still not giving you the full picture of Chicago communities’ health, am I?

To fully do this, we in applied epidemiology use a technique to come up with a “Healthy Condition Index.” The HCI takes health indicators from five domains and standardizes them. (More on that in a second.) The five domains are: Demographics, like percent of an underserved minority living in a community or the proportion of people without a high school diploma; Socioeconomics, like per capita income; Mortality, a measure of how much death is happening in a community, like infant mortality rate; Morbidity and risk factors, like a disease or condition happening in the communities; and Coverage and Health Services, like the number of women getting timely prenatal care or the proportion of children being screened on time for lead poisoning.

We take one indicator from these five domains and make sure that they’re not too collinear, meaning that the level to which one rises or falls given a predictive variable is not exactly the same level as which another one rises or falls. If they’re too collinear, they’re giving us the same information, so we wouldn’t need both. (For this example, I’m assuming the variables I picked are not collinear enough to cause a problem, but you should definitely do this check if you’re doing an actual evaluation of a city’s unmet health needs.)

Next, because the different variables are not on the same scale (some are percentages from 0% to 100% and others are money from $0 to infinity), we need to standardize them. It’s a bit complicated to explain in a blog post how we do standardization — and exactly why — so I’ll let the good folks at Crash Course – Statistics give you the full explanation:

Finally, we add up the Z-scores for each variable and use that as the HCI. (Disclaimer: My thesis advisor and doctoral school mentor, Carlos Castillo-Salgado, MD, MPH, DrPH, is one of the developers of the index.) That HCI number is more indicative of the true condition of the neighborhood/community in relation to another. It gives more information all in one time instead of having to compare two or more communities based on one variable at a time.

Let’s Look at It in R

So let’s dive in and do this analysis in R. You’ll need your favorite R programming environment, the public health statistics data from Chicago that can be found here, and the shapefiles for Chicago’s 77 communities that can be found here. I’ve created a function in R that will do several things as you’ll see in the code:

  • Extracts the names (characters) of the communities and saves them to use later
  • Extracts the variables (numbers) from the data and saves them to do the next set of operations
  • Calculates the Z-score for each variable (just like the video above describes)
  • Multiplies variables that you choose by negative one (-1) in order to make sure the directionality of the vector is correct. (More on that in a second.)
  • Sums all the Z-scores into the HCI
  • Creates a data frame by reincorporating the community names into the file along with the Z-scores and the index values

At the end, you’re left with a dataframe that you can match to your shapefile and throw on a map.

Why negative one? Suppose that you’re looking at per capita income and infant mortality. The scale for per capita income is basically “larger is better,” right? The more per capita income, the better the community is in that respect. The scale for infant mortality rate is the opposite. It is “smaller is better.” The lower the infant mortality rate is, the better the community is in that respect. To make the two be heading in the right direction, once the Z-score is calculated, we multiply any variable whose direction is “lower is better” by negative one to flip its directionality. That way, negative Z-scores become positive, or “higher is better.”

Alright, so let’s do some programming. Fire up a new script, and pull in your data and do some clean-up:

# Import the data

chicago <- read.csv("data/chicago_public_health_stats_2013.csv",
                    stringsAsFactors = T)

# Get rid of community area number

drops <- c("Community.Area")
chicago <- chicago[,!(names(chicago) %in% drops)]

# Choose your five indicators and keep them along with the Community Area Name
# For this example, I will choose the following from their respective domains...
# Demographics: No.High.School.Diploma, lower is better*
# Socioeconomics: Per.Capita.Income, higher is better
# Mortality: Infant.Mortality.Rate, lower is better*
# Morbidity and Risk Factors: Tuberculosis, lower is better*
# Coverage and Health Services: Prenatal.Care.Beginning.in.First.Timester, higher is better

keeps <- c(
  "Community.Area.Name",
  "No.High.School.Diploma",
  "Per.Capita.Income",
  "Infant.Mortality.Rate",
  "Tuberculosis",
  "Prenatal.Care.Beginning.in.First.Trimester"
)
chicago <- chicago[keeps]

Note that I marked in the comments which variables are “lower is better” in their directionality. Next, we’re going to apply the function I created, “uhn.index” (UHN meaning “unmet health needs”):

# Now, we create a function "uhn.index" that will do several things
# noted in the code:

lower.better <- c("No.High.School.Diploma",
                  "Infant.Mortality.Rate",
                  "Tuberculosis")

uhn.index <-
  function(x, y) {
    # x is the dataframe we're using, y are the "lower is better" variables
    chars <-
      x %>% select_if(negate(is.numeric)) # Extracts the names of the community areas
    x <- x[, sapply(x, is.numeric)] # Extracts the numeric variables
    x = scale(x) # Calculates the Z scores of teh numeric variables
    myvars <- y # Incorporates the variables in y above as the ones
    x[, myvars] <-
      x[, myvars] * -1 # Multiplies the "lower is better" variables by negative one
    index <- rowSums(x) # Sums all the Z scores into the index score
    x <-
      data.frame(chars, x, index) # Creates the dataframe from the names, the Z scores and the index
    x <-
      x[order(index), ] # Sorts the index variable from smallest to largest
    return(x) # Returns the resulting dataframe
  }

# Now we apply that function to the Chicago dataframe

chicago.uhn <- uhn.index(chicago, lower.better)

# Let's see our final product

view(chicago.uhn)

At the end, you will have a dataframe named chicago.uhn that will have the names of the communities, the variables you selected from the five domains (noting that the original file had 29 variables).

Next, we import the shapefile:

# Import the shapefile for the community areas of Chicago

chicago.map <-
  readOGR("data/chicago_community_areas", "chicago_community_areas")
How the shapefile overlays the City of Chicago

Next, join the data you created and the original numbers. There was a little problem here because the original data had the community names in upper and lowercase. The shapefile had the names all in uppercase. So I had to do some manipulation before the join:

# Joining our chicago.uhn data with the shapefile

# First, we sort by Community.Area.Name

chicago.uhn <- chicago.uhn[with(chicago.uhn, order(Community.Area.Name)),]

# Next, because the community names in the shapefile are all in capitals...

chicago.uhn <- mutate_at(chicago.uhn, "Community.Area.Name", toupper)
chicago <- mutate_at(chicago, "Community.Area.Name", toupper)

# Now, the join...

chicago.data <-
  geo_join(chicago.map,
           chicago.uhn,
           "community",
           "Community.Area.Name",
           how = "left")

# And another join for the original data

chicago.data.o <-
  geo_join(chicago.map, chicago, "community", "Community.Area.Name", how = "left")

That’s it. You can now create your maps. For sake of simplicity, I will give you the code below for mapping the index:

# Let's map the Index of Unmet Health Needs

index <- tm_shape(chicago.data, unit = "mi") +
  tm_basemap("OpenTopoMap") +
  tm_fill(
    "index",
    title = "Index of Unmet Health Needs",
    midpoint = 0,
    style = "quantile",
    palette = "RdYlGn" # The minus sign before "RdYlGn" takes care of the direction of colors
  ) +
  tmap_options(legend.text.color = "black") +
  tm_borders("black",
             lwd = .5) +
  tm_compass(position = c("left", "center")) +
  tm_layout(
    compass.type = "4star",
    legend.text.size = 0.9,
    legend.title.size = 0.9
  ) +
  tm_legend(
    position = c("left", "bottom"),
    frame = F,
    bg.color = "white"
  ) +
  tm_text("community",
          size = 0.3) +
  tm_scale_bar(
    size = 0.5,
    color.dark = "black",
    color.light = "white",
    lwd = 1,
    position = c("left", "center")
  )

index # Look at the map

So let’s look at the maps of the variables first. Side-by-side, we’ll have the original data on the left and the standardized (Z-scores) values on the right:

Click to enlarge

They look basically the same, and that’s okay. What Z-scores do is that they tell us which communities are above average (positive Z-scores) or below average (negative Z-scores) and by how much. So that’s “No High School Diploma.”

Per capita income:

Click to enlarge

Note here that some of the communities in green on the left map become yellows on the right. This is because, although they are wealthy, they are not that far above the average for wealth. Remember that averages are very much influenced by extremes, and there are several communities in Chicago that are very low in per capita income. (Riverdale at the most south end of Chicago is an astounding $8,535 while Near North Side is $87,163.)

Infant mortality rate:

Click to enlarge

Tuberculosis:

Click to enlarge

Prenatal care:

Click to enlarge

This one was interesting. Note how some of the communities that are relatively wealthy fare below the average when it comes to prenatal care. This is why giving only one map as information on the health status of a city can be misleading. Here, we have many places where prenatal care is around 60%, and very few where it is above 90%. Even when standardized, the number of communities below average (Z-score below zero) are distributed into even the more wealthier communities in the northeast part of the city.

Finally, the index…

Click to enlarge

This index is aimed at giving a clearer picture of the gradation of health needs in the city of Chicago in 2013. Remember that it is the sum of the Z-scores for the other indicators, so it kind of evens out the differences in the indicators’ distributions from one community to the other. Now you can identify the communities that have the most needs across different domains.

Of course, there are other ways to standardize and account for the differences in the scales of the variables and the differences in the variables between communities. The main takeaway is that one variable, presented on one map, may not be giving the full picture of what is going on. Even two variables on two maps may not do the trick. The best thing to do is to get variables from different domains and standardize them so you’re comparing apples to apples.