Details of noaastormevents

Brooke Anderson and Ziyu Chen

2017-07-13

This vignette provides more details on how the noaastormevents package interacts with the online NOAA Storm Events database to pull storm event listings based on user queries.

Structure of NOAA Storm Events data

The NOAA Storm Events data is available online at https://www.ncdc.noaa.gov/stormevents/. That website includes documentation on the data, as well as a page that allows bulk data download of yearly csv files either through ftp or http (https://www.ncdc.noaa.gov/stormevents/ftp.jsp). Data is available from January 1950 and tends to be updated to within a few months of present.

Data is stored in bulk by year in compressed comma-separated files (.csv.gz files). Each year has three compressed files available:

File names for each file include both the year of the data (e.g., “1950”) and the date the file was last modified (e.g., “20170120”). Files are given regular names other than these two specifications. This regular naming scheme allows us to use regular expressions in code within the noaastormevents package on all listed file names to identify the exact name of a file for a specific year, as explained in the next section.

The size of all three file types has increased with time (see figure below; note that the y-axis is log 10). The largest file for any given year is the “Details” file. Most file sizes increased substantially in 1996 (dotted vertical line), when the database dramatically expanded the types of events it included. Before 1996, the database covered tornadoes and, for some years, a few other types of events. From 1996, the database expanded to include events like floods, tropical storms, snow storms, etc. While “Locations” files exist in the database for early years, they contain no information until 1996. See the documentation at the NOAA Storm Events database website for more information on the coverage of the database at different times across its history.

Downloading NOAA Storm Events data for a year

The database data is stored in files separated by year, so the file for an entire year is identified and downloaded when a user asks for event listings from any time or any type of event that year. For example, if a user wants to list flood events from the week of Hurricane Floyd in 1999, functions in the noaastormevents package would first identify and download the full “Details” data file for 1999 and then filter down to flood events starting in the correct week.

To identify the online file path for a specific year, the find_file_name function in the noaastormevents package uses the htmltab function (from the package of the same name) to create a dataframe listing all files available for download from the NOAA Storm Events database. The function then uses regular expressions to identify the file name in that listing for the requested year. For example, the name of the file with “Details” information for 1999 can be determined with:

find_file_name(year = "1999", file_type = "detail")
## [1] "StormEvents_details-ftp_v1.0_d1999_c20160223.csv.gz"

Here is the full definition of the find_file_name function:

find_file_name
## function (year = NULL, file_type = "details") 
## {
##     url <- paste0("http://www1.ncdc.noaa.gov/pub/data/swdi/", 
##         "stormevents/csvfiles/")
##     page <- htmltab::htmltab(doc = url, which = 1, rm_nodata_cols = FALSE)
##     all_file_names <- page$Name
##     file_year <- paste0("_d", year, "_")
##     file_name <- grep(file_type, grep(file_year, all_file_names, 
##         value = TRUE), value = TRUE)
##     if (length(file_name) == 0) {
##         stop("No file found for that year and / or file type.")
##     }
##     return(file_name)
## }
## <environment: namespace:noaastormevents>

Typically, this function will only be used internally rather than called directly by a user.

Once the file name has been determined, a function in the package then downloads that file to the user’s computer. For some years, files are very large, so this download can take a little while. To avoid downloading data from the same year more than once within an R session, the downloading function stores the downloaded data for that year in a temporary environment in the R user’s session. In later requests for the same year, the function will first check for data from this year in the temporary environment and only download the data from the online database if it is not already available on the user’s computer.

This environment is created to be temporary, which means that it is deleted at the end of the current R session. While some packages that access online databases cache any downloaded data in a way that persists between R sessions, we chose not to do that and instead only cache within an R session, but delete all data at the close of the R session. This is because some of the Storm Event files are very large, and most users will likely only want to keep a small subset of the data for a given year (e.g., only flood events during the week of Hurricane Floyd). It would be wasteful of memory to cache all the 1999 data indefinitely on the user’s computer in this case; instead, the user should use our package to create the desired subset of the data and then explicitly store that subset locally to use in future analysis.

The function for downloading the file for a year is called download_storm_data. Here is it’s full definition:

noaastormevents:::download_storm_data
## function (year, file_type = "details") 
## {
##     file_name <- find_file_name(year = year, file_type = file_type)
##     path_name <- paste0("https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/", 
##         "csvfiles/", file_name)
##     if (!exists("noaastormevents_package_env")) {
##         temp <- tempfile()
##         utils::download.file(path_name, temp)
##         noaastormevents_package_env <<- new.env()
##         noaastormevents_package_env$lst <- list()
##         noaastormevents_package_env$lst[[as.character(year)]] <- suppressWarnings(utils::read.csv(gzfile(temp), 
##             as.is = TRUE))
##         unlink(temp)
##     }
##     else if (is.null(noaastormevents_package_env$lst[[as.character(year)]])) {
##         temp <- tempfile()
##         utils::download.file(path_name, temp)
##         noaastormevents_package_env$lst[[as.character(year)]] <- suppressWarnings(utils::read.csv(gzfile(temp), 
##             as.is = TRUE))
##         unlink(temp)
##     }
##     return(NULL)
## }
## <environment: namespace:noaastormevents>

Finally, the noaastormevents package allows a user to query storm events either by a date range or by a named historical tropical storm, rather than a year. The create_storm_data function inputs either a date range or a storm name, as well as the requested file type, and downloads data for the appropriate year or years. If the user requests a date range, the function will download yearly data files for all years included in that range. If the user requests a tropical storm, the function will pull the data for that particular year. Here is the full definition of create_storm_data:

create_storm_data
## function (date_range = NULL, storm = NULL, file_type = "details") 
## {
##     if (!is.null(date_range)) {
##         date_range_years <- lubridate::year(date_range)
##         requested_years <- seq(from = date_range_years[1], to = date_range_years[2])
##         lapply(requested_years, download_storm_data)
##         for (i in 1:length(requested_years)) {
##             requested_year <- as.character(requested_years[i])
##             if (i == 1) {
##                 storm_data <- noaastormevents_package_env$lst[[requested_year]]
##             }
##             else {
##                 storm_data <- rbind(storm_data, noaastormevents_package_env$lst[[requested_year]])
##             }
##         }
##     }
##     else if (!is.null(storm)) {
##         storm_year <- as.numeric(gsub("[^0-9]", "", storm))
##         download_storm_data(year = storm_year, file_type = file_type)
##         storm_data <- noaastormevents_package_env$lst[[as.character(storm_year)]]
##     }
##     else {
##         stop("You must specify either `date_range` or `storm`.")
##     }
##     storm_data <- dplyr::tbl_df(storm_data)
##     return(storm_data)
## }
## <environment: namespace:noaastormevents>

As a note, many of the functions in the noaastormevents package that allow linking events with tropical storms rely on historical data for the storms, including storm tracks, estimated distances to eastern U.S. counties, and dates when the storm was closest to each county. The package pulls this historical data from the hurricaneexposuredata package, through the interfacing package hurricaneexposure. The hurricane data goes from 1988 to (currently) 2015 and includes all Atlantic basin tropical storms that came within 250 km of at least one U.S. county. The following storms are included in that package and so available to be used for functions in noaastormevents:

Year Storms
1988 Alberto, Beryl, Chris, Florence, Gilbert, Keith
1989 Allison, Chantal, Hugo, Jerry
1990 Bertha, Marco
1991 Ana, Bob, Fabian, Notnamed
1992 Andrew, Danielle, Earl
1993 Arlene, Emily
1994 Alberto, Beryl, Gordon
1995 Allison, Dean, Erin, Gabrielle, Jerry, Opal
1996 Arthur, Bertha, Edouard, Fran, Josephine
1997 Subtrop, Ana, Danny
1998 Bonnie, Charley, Earl, Frances, Georges, Hermine, Mitch
1999 Bret, Dennis, Floyd, Harvey, Irene
2000 Beryl, Gordon, Helene, Leslie
2001 Allison, Barry, Gabrielle, Michelle
2002 Arthur, Bertha, Cristobal, Edouard, Fay, Gustav, Hanna, Isidore, Kyle, Lili
2003 Bill, Claudette, Erika, Grace, Henri, Isabel
2004 Alex, Bonnie, Charley, Frances, Gaston, Hermine, Ivan, Jeanne, Matthew
2005 Arlene, Cindy, Dennis, Emily, Katrina, Ophelia, Rita, Tammy, Wilma
2006 Alberto, Beryl, Chris, Ernesto
2007 Andrea, Barry, Erin, Gabrielle, Humberto, Noel
2008 Cristobal, Dolly, Edouard, Fay, Gustav, Hanna, Ike, Kyle, Paloma
2009 Claudette, Ida
2010 Alex, Bonnie, Earl, Hermine, Nicole, Paula
2011 Bret, Don, Emily, Irene, Lee
2012 Alberto, Beryl, Debby, Isaac, Sandy
2013 Andrea, Dorian, Karen
2014 Arthur
2015 Ana, Bill, Claudette

Structure of “Details” data files

While the noaastormevent package focuses on higher-level functions, which result in a simplified and cleaned version of this storm events data, a user can use the create_storm_data function to pull the full dataset for a year into R and work with the raw, uncleaned version. For example, here is a call that pulls the raw data for 2015 into an R object called events_2015.

events_2015 <- create_storm_data(date_range = c("2015-01-01", "2015-12-31"))
slice(events_2015, 1:3)
## # A tibble: 3 x 51
##   BEGIN_YEARMONTH BEGIN_DAY BEGIN_TIME END_YEARMONTH END_DAY END_TIME
##             <int>     <int>      <int>         <int>   <int>    <int>
## 1          201501        27       1200        201501      28      400
## 2          201501        24        700        201501      24     2100
## 3          201501        27        600        201501      27     1200
## # ... with 45 more variables: EPISODE_ID <int>, EVENT_ID <int>,
## #   STATE <chr>, STATE_FIPS <int>, YEAR <int>, MONTH_NAME <chr>,
## #   EVENT_TYPE <chr>, CZ_TYPE <chr>, CZ_FIPS <int>, CZ_NAME <chr>,
## #   WFO <chr>, BEGIN_DATE_TIME <chr>, CZ_TIMEZONE <chr>,
## #   END_DATE_TIME <chr>, INJURIES_DIRECT <int>, INJURIES_INDIRECT <int>,
## #   DEATHS_DIRECT <int>, DEATHS_INDIRECT <int>, DAMAGE_PROPERTY <chr>,
## #   DAMAGE_CROPS <chr>, SOURCE <chr>, MAGNITUDE <dbl>,
## #   MAGNITUDE_TYPE <chr>, FLOOD_CAUSE <chr>, CATEGORY <int>,
## #   TOR_F_SCALE <chr>, TOR_LENGTH <dbl>, TOR_WIDTH <int>,
## #   TOR_OTHER_WFO <chr>, TOR_OTHER_CZ_STATE <chr>,
## #   TOR_OTHER_CZ_FIPS <int>, TOR_OTHER_CZ_NAME <chr>, BEGIN_RANGE <int>,
## #   BEGIN_AZIMUTH <chr>, BEGIN_LOCATION <chr>, END_RANGE <int>,
## #   END_AZIMUTH <chr>, END_LOCATION <chr>, BEGIN_LAT <dbl>,
## #   BEGIN_LON <dbl>, END_LAT <dbl>, END_LON <dbl>,
## #   EPISODE_NARRATIVE <chr>, EVENT_NARRATIVE <chr>, DATA_SOURCE <chr>

This raw data has 51 columns. This includes:

Event types

The following sections provide some summary statistics for data from this database for a single year (2015), to help users better understand the available data. Users may want to conduct similar data analysis themselves with the set of data they pull from the NOAA Storm Events database relevant to a particular research project. The code from this vignette (available at the package’s GitHub repository) can serve as a starting point for that.

In the 2015 event listings, here are the types of events and the number of reported events for each:

Event type Number of events in 2015
Thunderstorm Wind 14,400
Hail 9,398
Flash Flood 5,063
Winter Weather 4,242
Winter Storm 3,537
Flood 2,587
Heavy Snow 2,557
High Wind 2,143
Marine Thunderstorm Wind 1,880
Heavy Rain 1,518
Drought 1,397
Tornado 1,319
Cold/Wind Chill 925
Extreme Cold/Wind Chill 853
Dense Fog 819
Strong Wind 711
High Surf 561
Lightning 403
Heat 397
Funnel Cloud 362
Blizzard 348
Frost/Freeze 345
Ice Storm 327
Excessive Heat 325
Wildfire 284
Coastal Flood 262
Waterspout 240
Sleet 95
Astronomical Low Tide 80
Rip Current 68
Lake-Effect Snow 63
Debris Flow 54
Tropical Storm 49
Dust Storm 41
Avalanche 27
Marine High Wind 24
Marine Hail 19
Dust Devil 10
Freezing Fog 10
Marine Strong Wind 9
Hurricane 7
Seiche 6
Storm Surge/Tide 4
Tropical Depression 4
Dense Smoke 2
Marine Dense Fog 2
Sneakerwave 1
Tsunami 1

Here are how the start dates for listings for each event type are distributed over the year (event types are ordered by decreasing total count during the year; note that the y-axes vary depending on the range of events by date for each event type):

Many event types are clearly seasonal (e.g., winter weather, winter storms, heavy snow, cold, extreme cold, blizzards, ice storms, lake-effect snow, and avalanches are all much more common during winter months, while tropical depressions and tropical storms are all limited to the hurricane season). However, for some events, reporting seasonal patterns might be based not just on the true pattern of events but also on the timing of important exposures and impacts of the events. For example, rip currents have many more listings during the spring and summer, which may be related to events being more likely to be listed when more people are swimming. Frost event listings are particularly high at the start and end of the frost season, rather than in the middle of winter, which may be related to the impacts of frost on crops being higher in spring and fall than during the winter. If working with this data, it important to keep in mind that the data are based on reporting, and there may be related influences on the probability of an event being reported and included in the data that differ from using data from something like a weather station.

Episodes versus events

“Episodes” seem to collect related “events”, where events can vary in the type or location of the event, while an “episode” collects events that belong to the same large system. The following graph shows, for each episode listed in 2015, the number of events listed for the episode (x-axis) and the size (in days) of the range of begin dates across events in the episode.

An episode will never include events in more than one state, so a large weather system could potentially be described by multiple episodes in different states:

events_2015 %>% 
  select(EPISODE_ID, STATE) %>% 
  group_by(EPISODE_ID) %>% 
  summarize(n_states = length(unique(STATE))) %>% 
  ungroup() %>% 
  summarize(max_n_states = max(n_states))
## # A tibble: 1 x 1
##   max_n_states
##          <dbl>
## 1            1

Here are maps with the beginning locations of events in the episodes with the most events in 2015. Note that the beginning latitude and longitude are not listed for every event, resulting in one of the episodes not having any points on the map. From the other maps, it is clear that events within the episode were fairly close together.

For these episodes with the most events in 2015, the following graph shows the number of events reported for the episode. One of the episode was a winter storm, another was heavy rains and floods, while the rest of the episodes included high winds, hail, tornadoes, rain, and / or flooding.

Once we removed event types with less that 50 listings in 2015, we did a cluster analysis of event types, to group events that are more likely to occur together within an episode. The following plot shows the resulting cluster structure of these event types.

The next graph shows the number of events of each event type (excluding event types with less than 50 total listings in 2015). Each row represents an episode.

How events are reported

The SOURCE column of the raw data gives information on how each event was reported.

Source of event report Number of events reported in 2015
Trained Spotter 9,904
Public 6,965
Law Enforcement 5,317
Emergency Manager 5,236
Mesonet 3,378
COOP Observer 2,952
911 Call Center 2,439
Broadcast Media 2,132
ASOS 1,918
Department of Highways 1,837
Amateur Radio 1,422
Social Media 1,356
AWOS 1,311
Official NWS Observations 1,100
State Official 1,065
NWS Storm Survey 1,057
Drought Monitor 1,034
CoCoRaHS 961
NWS Employee 839
Newspaper 715
Fire Department/Rescue 681
River/Stream Gage 634
Storm Chaser 528
Other Federal Agency 494
C-MAN Station 360
County Official 295
RAWS 282
SNOTEL 248
SHAVE Project 241
Park/Forest Service 229
Utility Company 171
Buoy 149
WLON 134
Unknown 93
AWSS 72
Local Official 58
Post Office 48
Lifeguard 38
Coast Guard 28
Airplane Pilot 24
Mariner 23
Insurance Company 7
Tribal Official 4

The majority of events in this database, at least for 2015, were reported by either a trained spotter or the public.

The following graph shows, for each type of event in 2015, the percent reported by each source. For some types of events, reporting is dominated by a specific source. For example, most high surf reports come from trained spotters, while most drought reports come from drought monitors and most tornado reports come from the NWS Storm Survey. For other types of events, reporting sources are more diversified. Both axes of the plot are ordered by overall frequency (i.e., overall number of each type of event and overall number of reports from each source).

Event locations

Each event has a state listed for the event (STATE). The following graph gives the number of reported events in each state for 2015: