In this tutorial we will use R to first find a list of U.S. cities and then find the geolocations of these cities.
This requires two steps:
After doing a simple Google search, I found the following website where all the cities in each state are listed: https://www.britannica.com/topic/list-of-cities-and-towns-in-the-United-States-2023068.
In order to scrape the information here, we will need the followng packages:
library(rvest)
library(xml2)
library(stringr)
After we load the packages, the next step will be to read the html page:
#Use Brittanica Webpage to gather list of locations in each state in the United States
html <- "https://www.britannica.com/topic/list-of-cities-and-towns-in-the-United-States-2023068"
webpage <- read_html(html)
In the next step, we will need to find where the information we are looking for is located. To do so, we will ‘inspect’ the website. While you are browsing your website, click Ctrl+Shift+I, or, right click and select ‘Inspect’. A window on the right hand side will pop-up.
Html files have a nested structure. We will need to find where the state and city text are located in this file.
After some investigation, we find out that each state is designated with its own section with a separate section id; and then all the cities (or designated regions) in the state is listed in a list. Section ids for each section goes as ref326620,ref326621… until ref326669.
Let’s get the written text for Alabama only, with the reference number ref326620.
alabama <-
# The character indicates the type of the container (section) and its id (ref326620)
html_nodes(webpage, 'section#ref326620') %>%
# The '.' indicates the class
html_nodes('.md-crosslink')%>%
# Extract the raw text as a list
html_text()
alabama
## [1] "Alabama" "Alexander City" "Andalusia" "Anniston"
## [5] "Athens" "Atmore" "Auburn" "Bessemer"
## [9] "Birmingham" "Chickasaw" "Clanton" "Cullman"
## [13] "Decatur" "Demopolis" "Dothan" "Enterprise"
## [17] "Eufaula" "Florence" "Fort Payne" "Gadsden"
## [21] "Greenville" "Guntersville" "Huntsville" "Jasper"
## [25] "Marion" "Mobile" "Montgomery" "Opelika"
## [29] "Ozark" "Phenix City" "Prichard" "Scottsboro"
## [33] "Selma" "Sheffield" "Sylacauga" "Talladega"
## [37] "Troy" "Tuscaloosa" "Tuscumbia" "Tuskegee"
Note that the first entry is Alabama. If we wanted convert this to a data frame where the first column lists the cities and the second column lists the state, we would do the following:
library(data.table)
s <- alabama[1] #state character
p <- alabama[2:length(alabama)] #everything in alabama expect for the first element
tmp <- data.table( #a data table
city = p, #that binds the character vector of cities
state = rep(s,length(p)) #with another character vector that
#simply repeates the state character
#and is the same length with the character vector of cities.
)
tmp
## city state
## 1: Alexander City Alabama
## 2: Andalusia Alabama
## 3: Anniston Alabama
## 4: Athens Alabama
## 5: Atmore Alabama
## 6: Auburn Alabama
## 7: Bessemer Alabama
## 8: Birmingham Alabama
## 9: Chickasaw Alabama
## 10: Clanton Alabama
## 11: Cullman Alabama
## 12: Decatur Alabama
## 13: Demopolis Alabama
## 14: Dothan Alabama
## 15: Enterprise Alabama
## 16: Eufaula Alabama
## 17: Florence Alabama
## 18: Fort Payne Alabama
## 19: Gadsden Alabama
## 20: Greenville Alabama
## 21: Guntersville Alabama
## 22: Huntsville Alabama
## 23: Jasper Alabama
## 24: Marion Alabama
## 25: Mobile Alabama
## 26: Montgomery Alabama
## 27: Opelika Alabama
## 28: Ozark Alabama
## 29: Phenix City Alabama
## 30: Prichard Alabama
## 31: Scottsboro Alabama
## 32: Selma Alabama
## 33: Sheffield Alabama
## 34: Sylacauga Alabama
## 35: Talladega Alabama
## 36: Troy Alabama
## 37: Tuscaloosa Alabama
## 38: Tuscumbia Alabama
## 39: Tuskegee Alabama
## city state
So, let’s do the same for all its from ref326620 to ref326669. Below is a loop that creates a list of dataframes for each state.
df <- list()
for (i in 1:50){
places <- html_nodes(webpage, paste0('section#ref',i+326619)) %>%
# The '.' indicates the class
html_nodes('.md-crosslink') %>%
# Extract the raw text as a list
html_text()
s <- places[1]
p <- places[2:length(places)]
df[[i]] <- data.table( #a data table
city = p, #that binds the character vector of cities
state = rep(s,length(p)) #with another character vector that
#simply repeates the state character
#and is the same length with the character vector of cities.
)
}
To row bind the list of dataframes to one giant dataframe, I will use rbindlist() function from data.table package:
df <- rbindlist(df)
df #print df
## city state
## 1: Alexander City Alabama
## 2: Andalusia Alabama
## 3: Anniston Alabama
## 4: Athens Alabama
## 5: Atmore Alabama
## ---
## 1956: Sheridan Wyoming
## 1957: Ten Sleep Wyoming
## 1958: Thermopolis Wyoming
## 1959: Torrington Wyoming
## 1960: Worland Wyoming
Next, I will use ggmap package to find the geolocations for the list of cities we have.
library(ggmap)
To use Google API to retrieve this information, you need to get a Google API key. The documentation for how to get an API key can be found [here]https://developers.google.com/maps/documentation/javascript/get-api-key.
Once you have your google API key, you first need to register your key and then use *mutate_geocode() function to retrieve the geolocation information.
register_google(key = your_key_here)
It would make sense to enter both the city and state name as the search term, rather than only the city name. So, let’s generate a new column for this. I will use tidyverse packages (specifically dplyr) for this. Please note that our df is a data.table, not a tibble. So we will need to convert our data frame to a tibble first.
library(dplyr)
df <- as_tibble(df) %>%
mutate(search = paste(city,state))
df
## # A tibble: 1,960 x 3
## city state search
## <chr> <chr> <chr>
## 1 Alexander City Alabama Alexander City Alabama
## 2 Andalusia Alabama Andalusia Alabama
## 3 Anniston Alabama Anniston Alabama
## 4 Athens Alabama Athens Alabama
## 5 Atmore Alabama Atmore Alabama
## 6 Auburn Alabama Auburn Alabama
## 7 Bessemer Alabama Bessemer Alabama
## 8 Birmingham Alabama Birmingham Alabama
## 9 Chickasaw Alabama Chickasaw Alabama
## 10 Clanton Alabama Clanton Alabama
## # ... with 1,950 more rows
Ok, so now we are ready to collect the geolocation information! Because we have a long list of cities, I will collect this information for the first 10 cities only, for this tutorial. You can use the full data if you like, but takes a while to run.
small_df <- df[1:10,] # only the first 10 rows of the data
geocode_df <- mutate_geocode(small_df, #small data
search) #search term column
geocode_df
## # A tibble: 10 x 5
## city state search lon lat
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Alexander City Alabama Alexander City Alabama -86.0 32.9
## 2 Andalusia Alabama Andalusia Alabama -86.5 31.3
## 3 Anniston Alabama Anniston Alabama -85.8 33.7
## 4 Athens Alabama Athens Alabama -87.0 34.8
## 5 Atmore Alabama Atmore Alabama -87.5 31.0
## 6 Auburn Alabama Auburn Alabama -85.5 32.6
## 7 Bessemer Alabama Bessemer Alabama -87.0 33.4
## 8 Birmingham Alabama Birmingham Alabama -86.8 33.5
## 9 Chickasaw Alabama Chickasaw Alabama -88.1 30.8
## 10 Clanton Alabama Clanton Alabama -86.6 32.8
So mutate_geocode() function adds two column to our dataset: lat (for latitude) and lon (for longitude).