American Community Survey data provides demographic information such as population, age, gender, race, ethinicity… at the geographic unit level. These units can be as big as states or as fine-grained as tracts (I can hear you asking what a tract is… )
And actually by defining the ACS this way I am not doing a good service because the survey collects ways more information than what I described here. Take a look at their webpage for more information: https://www.census.gov/programs-surveys/acs/data.html
My point is: it is a great resource for looking at geographical dynamics!
Below are the list of topics you can find variables for in ACS:
Ok, why are we using R again? Because the Census data has an API you can use to gather this dataset easily to your R working environment!
We will use the tidycensus package. While the package page gives more information about the package documentation and provides tutorials, I will show some tricks to organize the data efficiently when you are collecting multiple variables at a time.
This is a super simple process. Simply go visit: http://api.census.gov/data/key_signup.html and get one.
Then, load the package and register your key. I saved mine to a csv file and so I will upload my key from there. (Note: do not share your key with others. It is a good practice to load such information from a dataset when you are creating a tutorial, like this one.)
#load libraries
library(tidycensus)
library(tidyverse)
FALSE -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
FALSE v ggplot2 3.3.3 v purrr 0.3.4
FALSE v tibble 3.0.4 v dplyr 1.0.2
FALSE v tidyr 1.1.2 v stringr 1.4.0
FALSE v readr 1.4.0 v forcats 0.5.0
FALSE -- Conflicts ------------------------------------------ tidyverse_conflicts() --
FALSE x dplyr::filter() masks stats::filter()
FALSE x dplyr::lag() masks stats::lag()
#load census key (I saved it as cvs file so that I do not have to type it in.)
key = read_csv(file.path(bucket,'census_api_key.csv'))$value[1]
FALSE
FALSE -- Column specification --------------------------------------------------------
FALSE cols(
FALSE value = col_character()
FALSE )
census_api_key(key, install=TRUE,overwrite=TRUE) #load key
FALSE Your original .Renviron will be backed up and stored in your R HOME directory if needed.
FALSE Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY").
FALSE To use now, restart R or run `readRenviron("~/.Renviron")`
The Census collects a number of datasets. Since we are interested in the ACS today, we will either use 1-year ACS or 5-year ACS data. The 5-year version gathers data from 5 year time points and creates and estimate for your variable(s) of interest.
Because ACS assigns ‘codes’ rather than variable names, it is useful to check their codebook. Use the view option to explore the available variables in the codebook
#to see the list of all variables:
df_vars = load_variables(2019,dataset='acs5')
view(df_vars)
You can collect multiple variables at a time, at various geographic levels. Today, we will collect information on metro areas.
Let’s start by defining your geographic unit, and creating a dataset with the variable codes and your assigned variable names:
geog = "metropolitan statistical area/micropolitan statistical area"
my_varnames = c('population','median.age')
my_vars = c('B01001_001','B01002_001')
df_myvars = tibble(varname = my_varnames, variable = my_vars)
df_myvars
## # A tibble: 2 x 2
## varname variable
## <chr> <chr>
## 1 population B01001_001
## 2 median.age B01002_001
Now, use get_acs() function to gather the data you want:
#gather data
df_2019 = get_acs(geography = geog, #collect data at county level. Other available options are 'tract', 'blockgroup' or 'block'
variables = my_vars,
year = 2019,
survey = "acs5")
## Getting data from the 2015-2019 5-year ACS
head(df_2019)
## # A tibble: 6 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 10100 Aberdeen, SD Micro Area B01001_001 42824 NA
## 2 10100 Aberdeen, SD Micro Area B01002_001 37.3 0.8
## 3 10140 Aberdeen, WA Micro Area B01001_001 72779 NA
## 4 10140 Aberdeen, WA Micro Area B01002_001 44 0.3
## 5 10180 Abilene, TX Metro Area B01001_001 170669 NA
## 6 10180 Abilene, TX Metro Area B01002_001 34.1 0.2
Let’s make it more easily readable for our fellow researchers and teammates. First, let’s start adding a new column and use our assigned variable names rather than the ACS variable codes:
df_2019 = merge(df_2019, df_myvars, by = 'variable')
head(df_2019)
## variable GEOID NAME estimate moe varname
## 1 B01001_001 10100 Aberdeen, SD Micro Area 42824 NA population
## 2 B01001_001 12660 Baraboo, WI Micro Area 63922 NA population
## 3 B01001_001 10140 Aberdeen, WA Micro Area 72779 NA population
## 4 B01001_001 12680 Bardstown, KY Micro Area 45650 NA population
## 5 B01001_001 10180 Abilene, TX Metro Area 170669 NA population
## 6 B01001_001 12700 Barnstable Town, MA Metro Area 213496 NA population
Next, because we are interested in the metro areas only, let’s remove the micro areas listed:
df_2019 = df_2019[grep('Metro',df_2019$NAME), ] #gets only the metro areas
df_2019$NAME = gsub('Metro Area','',df_2019$NAME) #deletes the phrase 'Metro Area'
df_2019$NAME = trimws(df_2019$NAME) #trims whitespace from left and right hand side of the character elements.
Maybe we would want to search metro areas by state. So it would be useful to divide the ‘NAME’ column into two columns: ‘metro’ and ‘state’.
x = strsplit(df_2019$NAME,', ')
metro_names = c()
metro_states = c()
for(i in 1:length(x)){
metro_names = c(metro_names,x[[i]][1])
metro_states = c(metro_states,x[[i]][2])
}
df_2019$metro = metro_names
df_2019$state = metro_states
df_2019$NAME = NULL
Ok, most importantly: our dataset is in ‘long’ version. It would be useful to have one column for each variable (in our example, population and median age) so that we can use that easily in our analyses (for example, regressions, or map visualizations!) So let’s make that conversion. We will use the ‘estimate’ values for each variable.
#remove the columns we will not use
df_2019$moe = NULL
df_2019$variable = NULL
#convert the data from long to wide version
df_2019 = df_2019 %>%
pivot_wider(names_from = varname,
values_from = estimate)
And we are ready to roll!
head(df_2019)
## # A tibble: 6 x 5
## GEOID metro state population median.age
## <chr> <chr> <chr> <dbl> <dbl>
## 1 10180 Abilene TX 170669 34.1
## 2 12700 Barnstable Town MA 213496 53.3
## 3 10380 Aguadilla-Isabela PR 301107 42.6
## 4 10420 Akron OH 703845 40.3
## 5 12940 Baton Rouge LA 854318 35.6
## 6 10500 Albany GA 148436 36.9
Note: GEOID refers to the blocks. Note sure how useful that information is given that some metro areas will be located across multiple blocks. Use it at your own discretion.