Using the censusxy Package

Branson Fox and Christopher Prener, Ph.D.

2021-02-17

Overview

The censusxy package is designed to provide easy access to the U.S. Census Bureau Geocoding Tools in R.

Motivation

There do not exist many packages for free or reproducible geocoding in the R environment. However, the Census Bureau Geocoding Tools allow for both unlimited free geocoding as well as an added level of reproducibility compared to commercial geocoders. Many geospatial workflows involve a large quantity of addresses, hence our core focus is on batch geocoding.

Responsible Use

The U.S. Census Bureau makes their geocoding API available without any API key, and this package allows for virtually unlimited batch geocoding. Please use this package responsibly, as others will need use of this API for their research.

Installation

The easiest way to get censusxy is to install it from CRAN:

install.packages("censusxy")

The development version of censusxy can be accessed from GitHub with remotes:

# install.packages("remotes")
remotes::install_github("slu-openGIS/censusxy")

Workflow

Importing Data

If you plan on using the single address geocoding tools, your data do not need to be in any specific class. To use the batch geocoder, your data must be in a data.frame (or equivalent class). This package provides Homicides in St Louis City between 2008-2018 as example data.

data("stl_homicides")

Parsing Addresses

For use of the batch API, your address data needs to be structured. Meaning, your data contains seperate columns for street address, city, state and zipcode. You may find the postmastr package useful for this task. Only street address is mandatory, but omission of city, state or zip code drastically lowers the speed and accuracy of the batch geocoder.

Pick A Suitable API

The Census Geocoder contains 4 primary functions, 3 for single address geocoding, and 1 for batch geocoding. For interactive use cases, a Shiny application for example, the single line geocoder is recommended. For large quantities of addresses, the batch endpoint is favorable.

If your usecase is locating coordinates within census geometries, only a single coordinate function is available for this task.

Pick a Return Type

If you are interested in census geometries (composed of FIPS codes for state, county, tract and block), you should specify ‘geographies’ in the return argument. This also necessitates the use of a vintage.

Consider the Benchmark and Vintage

Vintage is only important to consider when geocoding census geographies. It has no impact on geocoding coordinates (location). You can obtain a data.frame of valid benchmarks and vintages with their respective functions. For vintages, you must supply the name or ID of the benchmark you have chosen.

cxy_benchmarks()
cxy_vintages(benchmark = 'Public_AR_Census2010')

Parallelization (UNIX ONLY)

If you are on a UNIX platform (macOS, Linux, etc.), you may take advantage of multiple threads to greatly increase performance. This functionality is not currently supported on Windows Operating Systems, however. All you have to do is specify the number of cores you would like to use and the function will automatically distribute the workload in the most efficient manner. The function will not allow you to specify more cores than are available, and will instead default to the maximum number of available cores.

Class and Output

When using the batch function, you may specify class to “sf” which will return the results as an sf object, allowing for quick preview or export of the spatial data. However, doing this will only return addresses for which the geocoder could successfully match. A helpful message denoting how many rows were removed will print in the console.

You may also specify output as “simple” or “full”. Simple returns only coordinates (and a GEOID if return = "geographies") and this is suitable for most use cases. If you desire all of the raw output from the geocoder, please specify full instead.

Timeout

The function contains an argument for timeout, which specifies how many minutes until the API query ends as an error. In this implementation, it is per 1000 addresses, not the whole batch size. It is set to default at 30 minutes, which should be appropriate for most internet speeds.

If a batch times out, the function will terminate, and you will lose any geocoding progress.

Be cautious that batches taking a long time may allow your computer to sleep, which may cause a batch to never return. macOS users may find the app amphetamine useful.

Usage

Geographies

If you would like to append census geographies, or have control of the benchmark in order to reproduce geocoding results, you will find it convenient to use the built in functions for doing so. If you are not concerned about reproducibility or geographies, the functions will default to the latest benchmark, and you may ignore this section.

Get the current valid benchmarks, these are used to geocode and show available vintages.

cxy_benchmarks()

Once, you’ve selected a benchmark, and only if you intend to append geographies, you should choose a vintage based on the benchmark you selected (Either by name or ID).

cxy_vintages('Public_AR_Census2010')
cxy_vintages(9) # Same as Above

Both of these should be supplied as arguments to your geocoding function.

Batch Geocoding

In this example, we will use the included stl_homicides data to show the full process for batch geocoding.

homicide_sf <- cxy_geocode(stl_homicides, address = street_address, city = city, state = state, zip = postal_code, class = "sf")

Note, however, that it returns only matched addresses, including those approximated by street length. If there are unmatched addresses, they will be dropped from the output. Use class = "dataframe" to return all addresses, including those that are unmatched.

Output returned as an sf object can be previewed with a package like mapview:

> mapview::mapview(homicide_sf)

Single Address Tools

We’ll investigate a few other use cases, specifically those involving fewer or single addresses.

You would like to geocode a single structured address:

cxy_single('20 N Grand Blvd', 'St. Louis', 'MO', 63103)

You would like to geocode a single unstructured address and append census geographies:

cxy_oneline("3700 Lindell Blvd, St. Louis, MO 63108", return = 'geographies', vintage = 'Current_Current')

You would like to append census geographies to a given coordinate:

cxy_geography(-90.23324, 38.63593)

Iteration

For a handful of addresses, you may want to iterate using these functions. Two examples using base R are provided here.

addresses <- c("20 N Grand Blvd, St. Louis MO 63103", "3700 Lindell Blvd, St. Louis, MO 63108")

# With A For Loop
geocodes <- vector("list", length = 2)
for (i in seq_along(addresses)){
  geocodes[[i]] <- cxy_oneline(addresses[i])
}

# With an Apply Function
geocodes <- lapply(addresses, cxy_oneline)

Getting Help