# Background

The study and conservation of the natural world relies on detailed information about the distributions, abundances, environmental associations, and population trends of species over time. For many taxa, this information is challenging to obtain at relevant geographic scales. The goal of the eBird Status and Trends project is to use data from eBird, the global community science bird monitoring program administered by The Cornell Lab of Ornithology, to generate a reliable, standardized source of biodiversity information for the world’s bird populations. To translate the eBird observations into robust data products, we use machine learning to fill spatiotemporal gaps, using local land cover descriptions derived from NASA MODIS and other remote sensing data, while controlling for biases inherent in species observations collected by community scientists.

This data set provides estimates of the full annual cycle distributions, abundances, and environmental associations for rscales::comma(nrow(ebirdst::ebirdst_runs)) species for the year 2020. For each species, distribution and abundance estimates are available for all 52 weeks of the year across a regular grid of locations that cover the globe at a resolution of 2.96 km X 2.96 km. Variation in detectability associated with the search effort is controlled by standardizing the estimates as the expected occurrence rate and count of the species on a 1 hour, 1 km checklist by an expert eBird observer at the optimal time of day for detecting the species.

To describe how each species is associated with features of its local environment, estimates of the relative importance of each remotely sensed variable (e.g. land cover, elevation, etc), are available throughout the year at a monthly temporal and regional spatial resolution. Additionally, to assess estimate quality, we provide upper and lower confidence bounds for abundance estimates and regional-seasonal scale validation metrics for the underlying statistical models. For more information about the data products see the FAQ and summaries. See Fink et al. (2019) for more information about the analysis used to generate these data.

# Data access

Data access is granted through an Access Request Form at: https://ebird.org/st/request. Access with this form generates a key to be used with this R package and is provided immediately (as long as commercial use is not requested). Our terms of use have been designed to be quite permissive in many cases, particularly academic and research use. When requesting data access, please be sure to carefully read the terms of use and ensure that your intended use is not restricted.

After completing the Access Request Form, you will be provided a Status and Trends access key, which you will need when downloading data. To store the key so the package can access it when downloading data, use the function set_ebirdst_access_key("XXXXX"), where "XXXXX" is the access key provided to you. Restart R after setting the access key.

Throughout the package vignettes, a simplified example dataset is used consisting of Yellow-bellied Sapsucker in Michigan. This dataset is designed to be small for faster download and is accessible without a key. The following will download the example dataset to you computer:

library(ebirdst)
library(raster)
library(sf)
library(dplyr)

path <- ebirdst_download(species = "example_data")

By default, ebirdst_download() downloads data to a sensible persistent directory on your computer. You can see what that directory is with the function ebirdst_data_dir(). You can change the default download location to a directory of your choosing by setting the environment variable EBIRDST_DATA_DIR, for example by calling usethis::edit_r_environ() and adding a line such as EBIRDST_DATA_DIR=/custom/download/directory/.

To download the data package for a species, provide the species code, English common name, or scientific name to ebirdst_download(). Note that data for all other species requires an API key to access. IMPORTANT: after downloading a data package, do not change the file structure or rename files. Doing so will prevent you from being able to work with the data in R.

# Species list

The full list of the species with data available for download is available in the data frame ebirdst_runs.

glimpse(ebirdst_runs)

which is included in this package. If you’re working in RStudio, you can use View() to interactively explore this data frame. You can also consult the Status and Trends landing page to see the full list of species.

All species go through a process of expert human review prior to being released. The ebirdst_runs data frame also contains information from this review process. Reviewers assess each of the four seasons: breeding, non-breeding, pre-breeding migration, and post-breeding migration. Resident (i.e., non-migratory) species are identified by having TRUE in the resident column of ebirdst_runs, and these species are assessed across the whole year rather than seasonally. ebirdst_runs contains two important pieces of information for each season: a quality rating and seasonal dates.

The seasonal dates define the weeks that fall within each season. Breeding and non-breeding season dates are defined for each species as the weeks during those seasons when the species’ population does not move. For this reason, these seasons are also described as stationary periods. Migration periods are defined as the periods of movement between the stationary non-breeding and breeding seasons. Note that for many species these migratory periods include not only movement from breeding grounds to non-breeding grounds, but also post-breeding dispersal, molt migration, and other movements.

Reviewers also examine the model estimates for each season to assess the amount of extrapolation or omission present in the model, and assign an associated quality rating ranging from 0 (lowest quality) to 3 (highest quality). Extrapolation refers to cases where the model predicts occurrence where the species is known to be absent, while omission refers to the model failing to predict occurrence where a species is known to be present.

A rating of 0 implies this season failed review and model results should not be used at all for this period. Ratings of 1-3 correspond to a gradient of more to less extrapolation and/or omission, and we often use a traffic light analogy when referring to them:

1. Red light: low quality, extensive extrapolation and/or omission and noise, but at least some regions have estimates that are accurate; can be used with caution in certain regions.
2. Yellow light: medium quality, some extrapolation and/or omission; use with caution.
3. Green light: high quality, very little or no extrapolation and/or omission; these seasons can be safely used.

# Data types and structure

eBird Status and Trends data packages are identified by the 6-letter eBird species code (e.g. yepsap for Yellow-bellied Sapsucker) and the Status and Trends estimation year (2020 for this version of the R package). They are stored within sub-directories that correspond to these variables, and you can get the path to this directory with:

# for non-example data use the species code or name instead of "example_data"
path <- get_species_path("example_data")

Within this data package directory, the following files and directories will be present:

• weekly/: a directory containing weekly estimates of occurrence, count, relative abundance, and percent of population on a regular grid in GeoTIFF format at three resolutions. See below for more details.
• seasonal/: a directory containing seasonal estimates of occurrence, count, relative abundance, and percent of population on a regular grid in GeoTIFF format at three resolutions. These are derived from the corresponding weekly raster data. Dates defining the boundary of each season are set on a species-specific basis by an expert reviewer familiar with the species. These dates are available in the ebirdst_runs data frame. Only seasons that passed the expert review process are included. See below for more details.
• ranges/: a directory containing GeoPackages storing range boundary polygons. See below for more details.
• stixel_summary.db: an SQLite database containing information on habitat associations, including predictor importance (PI) and partial dependence (PD) estimates.
• predictions.db: model predictions for a test dataset held out of the model fitting. These predictions are used for calculating predictive performance metrics (PPMs) using the ebirdst_ppms() function.
• config.json: run-specific parameters, mostly for internal use, but also containing useful parameters for mapping the abundance data. These parameters can be loaded with load_fac_map_parameters().

The vignette will cover the raster and range data.

## Weekly raster estimates

The core raster data products are the weekly estimates of occurrence, count, relative abundance, and percent of population. These are all stored in the widely used GeoTIFF raster format, and we refer to them as “weekly cubes” (e.g. the “weekly abundance cube”). All cubes have 52 weeks and cover the entire globe, even for species with ranges only covering a small region. They come with areas of predicted and assumed zeroes, such that any cells that are NA represent areas where we didn’t produce model estimates.

All estimates are the median expected value for a 1km, 1 hour eBird Traveling Count by an expert eBird observer at the optimal time of day and for optimal weather conditions to observe the given species.

• Occurrence occurrence: the expected probability of encountering a species.
• Count count: the expected count of a species, conditional on its occurrence at the given location.
• Relative abundance abundance: the expected relative abundance of a species, computed as the product of the probability of occurrence and the count conditional on occurrence. In addition to the median relative abundance, upper and lower confidence intervals (CIs) are provided, defined at the 10th and 90th quantile of relative abundance, respectively.
• Percent of population precent-population: the percent of the total relative abundance within each cell. This is a derived product calculated by dividing each cell value in the relative abundance raster by the sum of all cell values

All predictions are made on a standard 2.96 x 2.96 km global grid, however, for convenience lower resolution GeoTIFFs are also provided, which are typically much faster to work with. However, note that to keep file sizes small, the example dataset only contains low resolution data. The three resolutions are:

• High resolution (hr): the native 2.96 km resolution data
• Medium resolution (mr): the hr data aggregated by a factor of 3 in each direction resulting in a resolution of 8.89 km
• Low resolution (lr): the hr data aggregated by a factor of 9 in each direction resulting in a resolution of 26.7 km

The weekly cubes use the following naming convention:

weekly/<species_code>_<product>_<metric>_<resolution>_<year>.tif

where metric is typically median, except for the relative abundance CIs, which use lower and upper. The function load_raster() is used to load these data into R and takes arguments for product, metric and resolution. For example,

# weekly, low res, median occurrence
occ_lr <- load_raster(path, product = "occurrence", resolution = "lr")
occ_lr
# use parse_raster_dates() to get the date associated which each raster layer
parse_raster_dates(occ_lr)

# weekly, low res, abundance confidence intervals
abd_lower <- load_raster(path, product = "abundance", metric = "lower",
resolution = "lr")
abd_upper <- load_raster(path, product = "abundance", metric = "upper",
resolution = "lr")

The GeoTIFFs use the same Sinusoidal projection as NASA MODIS data. This projection is ideal for analysis, as it is an equal are projection, but is not ideal for mapping since it introduces significant distortion.

## Seasonal raster estimates

The seasonal raster estimates are provided for the same set of products and at the same three resolutions as the weekly estimates. They’re derived from the weekly data by taking the cell-wise mean or max across the weeks within each season. The seasonal boundary dates are defined through a process of expert review of each species, and are available in the data frame ebirdst_runs. Each season is also given a quality score from 0 (fail) to 3 (high quality), and seasons with a score of 0 are not provided.

The seasonal GeoTIFFs use the following naming convention:

seasonal/<species_code>_<product>_seasonal_<metric>_<resolution>_<year>.tif

where metric is either mean or max. The function load_raster(period = "seasonal") is used to load these data into R and takes arguments for product, metric and resolution. For example,

# seasonal, low res, mean relative abundance
abd_seasonal_mean <- load_raster(path, product = "abundance",
period = "seasonal", metric = "mean",
resolution = "lr")
# season that each layer corresponds to
names(abd_seasonal_mean)
# just the breeding season layer
abd_seasonal_mean[["breeding"]]

# seasonal, low res, max occurrence
occ_seasonal_max <- load_raster(path, product = "occurrence",
period = "seasonal", metric = "max",
resolution = "lr")

Finally, as a convenience, the data products include year-round rasters summarizing the mean or max across all weeks that fall within a season that passed the expert review process. These can be accessed similarly to the seasonal products, just with period = "full-year" instead. For example, these can layers be used in conservation planning to assess the most important sites across the full range and full annual cycle of a species.

# full year, low res, maximum relative abundance
abd_fy_max <- load_raster(path, product = "abundance",
period = "full-year", metric = "max",
resolution = "lr")

## Range boundaries

Seasonal range polygons are defined as the boundaries of non-zero seasonal relative abundance estimates, which are then (optionally) smoothed to produce more aesthetically pleasing polygons using the smoothr package. They are provided in the widely used GeoPackage format, with file naming convention:

ranges/<species_code>_range_seasonal_<raw/smooth>_<resolution>_<year>.gpkg

where raw refers to the polygons derived directly from the raster data and smooth refers to the smoothed polygons. Note that only low and medium resolution ranges are provided. These range polygons can be loaded with load_ranges():

# seasonal, low res, smoothed ranges
ranges <- load_ranges(path, resolution = "lr")
ranges

# subset to just the breeding season range using dplyr
range_breeding <- filter(ranges, season == "breeding")

## Habitat association and PPM data

The two SQLite database contained within the species data package, provide tabular data with information about modeled relationships between observations and the ecological covariates, as well as data that can be used to assess predictive performance. Note that these SQLite databases are quite large (many GBs in size) and are therefore note downloaded by default. To access these tabular data, you must use the parameter tifs_only = FALSE in the ebirdst_download()` function.

## References

Fink, D., T. Auer, A. Johnston, V. Ruiz‐Gutierrez, W.M. Hochachka, S. Kelling. 2019. Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications, 00(00):e02056. https://doi.org/10.1002/eap.2056