Collecting data from American Community Survey

ACS data

American Community Survey data provides demographic information such as population, age, gender, race, ethinicity… at the geographic unit level. These units can be as big as states or as fine-grained as tracts (I can hear you asking what a tract is… )

And actually by defining the ACS this way I am not doing a good service because the survey collects ways more information than what I described here. Take a look at their webpage for more information: https://www.census.gov/programs-surveys/acs/data.html

My point is: it is a great resource for looking at geographical dynamics!

Below are the list of topics you can find variables for in ACS:

Age and Sex
Business and Economy
Congressional Apportionment
Education
Emergency Management
Employment
Families and Living Arrangements
Geography
Health
Hispanic Origin
Housing
Income and Poverty
International Trade
Population
Population Estimates
Public Sector
Race
Research
Voting and Registration

Ok, why are we using R again? Because the Census data has an API you can use to gather this dataset easily to your R working environment!

We will use the tidycensus package. While the package page gives more information about the package documentation and provides tutorials, I will show some tricks to organize the data efficiently when you are collecting multiple variables at a time.

Get an API key

This is a super simple process. Simply go visit: http://api.census.gov/data/key_signup.html and get one.

Then, load the package and register your key. I saved mine to a csv file and so I will upload my key from there. (Note: do not share your key with others. It is a good practice to load such information from a dataset when you are creating a tutorial, like this one.)

#load libraries
library(tidycensus)
library(tidyverse)

FALSE -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

FALSE v ggplot2 3.3.3     v purrr   0.3.4
FALSE v tibble  3.0.4     v dplyr   1.0.2
FALSE v tidyr   1.1.2     v stringr 1.4.0
FALSE v readr   1.4.0     v forcats 0.5.0

FALSE -- Conflicts ------------------------------------------ tidyverse_conflicts() --
FALSE x dplyr::filter() masks stats::filter()
FALSE x dplyr::lag()    masks stats::lag()

#load census key (I saved it as cvs file so that I do not have to type it in.)
key = read_csv(file.path(bucket,'census_api_key.csv'))$value[1]

FALSE 
FALSE -- Column specification --------------------------------------------------------
FALSE cols(
FALSE   value = col_character()
FALSE )

census_api_key(key, install=TRUE,overwrite=TRUE) #load key

FALSE Your original .Renviron will be backed up and stored in your R HOME directory if needed.

FALSE Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY"). 
FALSE To use now, restart R or run `readRenviron("~/.Renviron")`

Choose your dataset type

The Census collects a number of datasets. Since we are interested in the ACS today, we will either use 1-year ACS or 5-year ACS data. The 5-year version gathers data from 5 year time points and creates and estimate for your variable(s) of interest.

Because ACS assigns ‘codes’ rather than variable names, it is useful to check their codebook. Use the view option to explore the available variables in the codebook

#to see the list of all variables:
df_vars = load_variables(2019,dataset='acs5')
view(df_vars)

Gather data

You can collect multiple variables at a time, at various geographic levels. Today, we will collect information on metro areas.

Let’s start by defining your geographic unit, and creating a dataset with the variable codes and your assigned variable names:

geog = "metropolitan statistical area/micropolitan statistical area"

my_varnames = c('population','median.age')
my_vars = c('B01001_001','B01002_001')

df_myvars = tibble(varname = my_varnames, variable = my_vars)
df_myvars

## # A tibble: 2 x 2
##   varname    variable  
##   <chr>      <chr>     
## 1 population B01001_001
## 2 median.age B01002_001

Now, use get_acs() function to gather the data you want:

#gather data
df_2019 = get_acs(geography = geog, #collect data at county level. Other available options are 'tract', 'blockgroup' or 'block'
                  variables = my_vars,
                  year = 2019,
                  survey = "acs5")

## Getting data from the 2015-2019 5-year ACS

head(df_2019)

## # A tibble: 6 x 5
##   GEOID NAME                    variable   estimate   moe
##   <chr> <chr>                   <chr>         <dbl> <dbl>
## 1 10100 Aberdeen, SD Micro Area B01001_001  42824    NA  
## 2 10100 Aberdeen, SD Micro Area B01002_001     37.3   0.8
## 3 10140 Aberdeen, WA Micro Area B01001_001  72779    NA  
## 4 10140 Aberdeen, WA Micro Area B01002_001     44     0.3
## 5 10180 Abilene, TX Metro Area  B01001_001 170669    NA  
## 6 10180 Abilene, TX Metro Area  B01002_001     34.1   0.2

Beautify your data!

Let’s make it more easily readable for our fellow researchers and teammates. First, let’s start adding a new column and use our assigned variable names rather than the ACS variable codes:

df_2019 = merge(df_2019, df_myvars, by = 'variable')
head(df_2019)

##     variable GEOID                           NAME estimate moe    varname
## 1 B01001_001 10100        Aberdeen, SD Micro Area    42824  NA population
## 2 B01001_001 12660         Baraboo, WI Micro Area    63922  NA population
## 3 B01001_001 10140        Aberdeen, WA Micro Area    72779  NA population
## 4 B01001_001 12680       Bardstown, KY Micro Area    45650  NA population
## 5 B01001_001 10180         Abilene, TX Metro Area   170669  NA population
## 6 B01001_001 12700 Barnstable Town, MA Metro Area   213496  NA population

Next, because we are interested in the metro areas only, let’s remove the micro areas listed:

df_2019 = df_2019[grep('Metro',df_2019$NAME), ] #gets only the metro areas
df_2019$NAME = gsub('Metro Area','',df_2019$NAME) #deletes the phrase 'Metro Area'
df_2019$NAME = trimws(df_2019$NAME) #trims whitespace from left and right hand side of the character elements.

Maybe we would want to search metro areas by state. So it would be useful to divide the ‘NAME’ column into two columns: ‘metro’ and ‘state’.

x = strsplit(df_2019$NAME,', ')
metro_names = c()
metro_states = c()
for(i in 1:length(x)){
  metro_names = c(metro_names,x[[i]][1])
  metro_states = c(metro_states,x[[i]][2])
}

df_2019$metro = metro_names
df_2019$state = metro_states
df_2019$NAME = NULL

Ok, most importantly: our dataset is in ‘long’ version. It would be useful to have one column for each variable (in our example, population and median age) so that we can use that easily in our analyses (for example, regressions, or map visualizations!) So let’s make that conversion. We will use the ‘estimate’ values for each variable.

#remove the columns we will not use
df_2019$moe = NULL
df_2019$variable = NULL

#convert the data from long to wide version
df_2019 = df_2019 %>%
  pivot_wider(names_from = varname,
              values_from = estimate)

And we are ready to roll!

head(df_2019)

## # A tibble: 6 x 5
##   GEOID metro             state population median.age
##   <chr> <chr>             <chr>      <dbl>      <dbl>
## 1 10180 Abilene           TX        170669       34.1
## 2 12700 Barnstable Town   MA        213496       53.3
## 3 10380 Aguadilla-Isabela PR        301107       42.6
## 4 10420 Akron             OH        703845       40.3
## 5 12940 Baton Rouge       LA        854318       35.6
## 6 10500 Albany            GA        148436       36.9

Note: GEOID refers to the blocks. Note sure how useful that information is given that some metro areas will be located across multiple blocks. Use it at your own discretion.