Working with Historical Climate Data

Dewey Dunnington

2018-01-09

Fetching data from Environment Canada’s historical climate data archive has always been a bit of a chore. In the old days, it was necessary to download data one click at a time from the organization’s search page. To bulk download hourly data would require a lot of clicks and a good chance of making a mistake and having to start all over again. The workflow for which this package is designed is as follows:

  1. Find climate stations using ec_climate_search_locations() or ec_climate_geosearch_locations().
  2. Download the data using ec_climate_data() or ec_climate_mudata().

Finding climate stations

The Environment Canada historical climate network is made up of >8000 stations that contain data from the mid-1800s to today. You can have a look at all of them using data(ec_climate_locations_all), or use ec_climate_search_locations() to use the package’s built-in search. You can search using the station name,

ec_climate_search_locations("gatineau")
#> Search results for ec_climate_search_locations(
#>   query = "gatineau"
#> ) 
#> [1] GATINEAU QC 5590           GATINEAU A QC 8375        
#> [3] OTTAWA GATINEAU A QC 50719 OTTAWA GATINEAU A QC 53001

…using the station identifier,

ec_climate_search_locations(5590)
#> Search results for ec_climate_search_locations(
#>   query = 5590
#> ) 
#> [1] GATINEAU QC 5590

…using a human-readable location with ec_climate_geosearch_locations():

ec_climate_geosearch_locations("gatineau QC", limit = 5)
#> Search results for ec_climate_geosearch_locations(
#>   query = "gatineau QC"
#>   limit = 5
#> ) 
#> [1] OTTAWA CITY HALL ON 4334 / 3.5 km     
#> [2] OTTAWA LA SALLE ACAD ON 4339 / 3.5 km 
#> [3] OTTAWA LEMIEUX ISLAND ON 4340 / 4.2 km
#> [4] OTTAWA U OF O ON 4346 / 5.3 km        
#> [5] OTTAWA STOLPORT A ON 7684 / 5.9 km

…or using a longitude/latitude pair (note the order lon, lat).

ec_climate_search_locations(c(-75.72327, 45.45724), limit = 5)
#> Search results for ec_climate_search_locations(
#>   query = c(-75.72327, 45.45724)
#>   limit = 5
#> ) 
#> [1] OTTAWA CITY HALL ON 4334 / 3.5 km     
#> [2] OTTAWA LA SALLE ACAD ON 4339 / 3.5 km 
#> [3] OTTAWA LEMIEUX ISLAND ON 4340 / 4.2 km
#> [4] OTTAWA U OF O ON 4346 / 5.3 km        
#> [5] OTTAWA STOLPORT A ON 7684 / 5.9 km

Typically, you will also want to look for stations that contain data for a given year at a given resolution. For this, use the year and timeframe arguments:

ec_climate_geosearch_locations(
  "gatineau QC",
  year = 2014:2016,
  timeframe = "daily",
  limit = 5
)
#> Search results for ec_climate_geosearch_locations(
#>   query = "gatineau QC"
#>   timeframe = "daily"
#>   year = 2014:2016
#>   limit = 5
#> ) 
#> [1] CHELSEA QC 5585 / 8.3 km (daily 1927-2017)        
#> [2] OTTAWA CDA ON 4333 / 8.6 km (daily 1889-2017)     
#> [3] OTTAWA CDA RCS ON 30578 / 8.6 km (daily 2000-2018)
#> [4] OTTAWA INTL A ON 49568 / 15.8 km (daily 2011-2018)
#> [5] ANGERS QC 5574 / 17.0 km (daily 1962-2017)

If you would like results as a data frame, you can use as.data.frame() or tibble::as_tibble() to get all available information about the result. For information about each column, see ?ec_climate_locations_all.

ec_climate_geosearch_locations(
  "gatineau QC",
  year = 2014:2016,
  timeframe = "daily",
  limit = 5
) %>%
  tibble::as_tibble()
#> # A tibble: 5 x 20
#>                  location longitude latitude     timezone_id
#>                     <chr>     <dbl>    <dbl>           <chr>
#> 1         CHELSEA QC 5585    -75.78    45.52 America/Toronto
#> 2      OTTAWA CDA ON 4333    -75.72    45.38 America/Toronto
#> 3 OTTAWA CDA RCS ON 30578    -75.72    45.38 America/Toronto
#> 4  OTTAWA INTL A ON 49568    -75.67    45.32 America/Toronto
#> 5          ANGERS QC 5574    -75.55    45.55 America/Toronto
#> # ... with 16 more variables: lst_utc_offset <dbl>, station_id <int>,
#> #   name <chr>, province <chr>, climate_id <chr>, wmo_id <int>,
#> #   tc_id <chr>, elevation_m <dbl>, first_year <int>, last_year <int>,
#> #   hly_first_year <int>, hly_last_year <int>, dly_first_year <int>,
#> #   dly_last_year <int>, mly_first_year <int>, mly_last_year <int>

Downloading data

The ec_climate_data() function is the primary method to download and read climate data. This function takes some liberties with the original data and makes some assumptions about what is useful output, including parsing values as numerics and transforming column names to use lowercase and underscores. As an example, I’ll use the station for Chelsea, QC, because I like the ice cream there.

The ec_climate_data() function can accept location identifiers in a few ways: the integer station ID, or (an unambiguous abbreviation of) the location identifier; I suggest using the full name of the location to avoid typing the wrong station ID by accident. You will also need a start and end date (these can be actual Dates or strings in the form "YYYY-MM-dd") for daily and hourly requests.

# find the station ID (CHELSEA QC 5585)
ec_climate_search_locations("chelsea", timeframe = "daily", year = 2015)
#> Search results for ec_climate_search_locations(
#>   query = "chelsea"
#>   timeframe = "daily"
#>   year = 2015
#> ) 
#> [1] CHELSEA QC 5585 (daily 1927-2017)

# load the data
ec_climate_data(
  "CHELSEA QC 5585", timeframe = "daily", 
  start = "2015-01-01", end = "2015-12-31"
)
#> # A tibble: 365 x 29
#>             dataset        location  year month   day       date
#>               <chr>           <chr> <int> <int> <int>     <date>
#>  1 ec_climate_daily CHELSEA QC 5585  2015     1     1 2015-01-01
#>  2 ec_climate_daily CHELSEA QC 5585  2015     1     2 2015-01-02
#>  3 ec_climate_daily CHELSEA QC 5585  2015     1     3 2015-01-03
#>  4 ec_climate_daily CHELSEA QC 5585  2015     1     4 2015-01-04
#>  5 ec_climate_daily CHELSEA QC 5585  2015     1     5 2015-01-05
#>  6 ec_climate_daily CHELSEA QC 5585  2015     1     6 2015-01-06
#>  7 ec_climate_daily CHELSEA QC 5585  2015     1     7 2015-01-07
#>  8 ec_climate_daily CHELSEA QC 5585  2015     1     8 2015-01-08
#>  9 ec_climate_daily CHELSEA QC 5585  2015     1     9 2015-01-09
#> 10 ec_climate_daily CHELSEA QC 5585  2015     1    10 2015-01-10
#> # ... with 355 more rows, and 23 more variables: data_quality <chr>,
#> #   max_temp_c <dbl>, max_temp_flag <chr>, min_temp_c <dbl>,
#> #   min_temp_flag <chr>, mean_temp_c <dbl>, mean_temp_flag <chr>,
#> #   heat_deg_days_c <dbl>, heat_deg_days_flag <chr>,
#> #   cool_deg_days_c <dbl>, cool_deg_days_flag <chr>, total_rain_mm <dbl>,
#> #   total_rain_flag <chr>, total_snow_cm <dbl>, total_snow_flag <chr>,
#> #   total_precip_mm <dbl>, total_precip_flag <chr>, snow_on_grnd_cm <dbl>,
#> #   snow_on_grnd_flag <chr>, dir_of_max_gust_10s_deg <dbl>,
#> #   dir_of_max_gust_flag <chr>, spd_of_max_gust_km_h <dbl>,
#> #   spd_of_max_gust_flag <chr>

The package can also produce the data in parameter-long form so that you can easily use ggplot to visualize. To “gather” the value and flag columns to long form, use ec_climate_long().

library(ggplot2)
df <- ec_climate_data(
  "CHELSEA QC 5585", timeframe = "daily", 
  start = "2015-01-01", end = "2015-12-31"
) %>%
  ec_climate_long()
  
ggplot(df, aes(date, value)) + 
  geom_line() + 
  facet_wrap(~param, scales="free_y") +
  scale_x_date(date_labels = "%b")

The function can accept a vector for the location parameter, which it uses to combine data from multiple locations. How do Chelsea, QC and Kentville, NS stack up during the month of November 2015?

df <- ec_climate_data(
  c("CHELSEA QC 5585", "KENTVILLE CDA CS NS 27141"), 
  timeframe = "daily", 
  start = "2015-11-01", "2015-11-30"
) %>%
  ec_climate_long()
#> Downloading 2 files (use quiet = FALSE for details)

ggplot(df, aes(date, value, col = location)) + 
  geom_line() + 
  facet_wrap(~param, scales="free_y") +
  scale_x_date(date_labels = "%d")

This function can download a whole lot of data, so it’s worth doing a little math before you overwhelm your computer with data that it can’t load into memory. As an example, I tested this function by downloading daily data for every station in Nova Scotia between 1900 and 2016, which took 2 hours and resulted in a 1.3 gigabyte data frame. If you’re trying to do something at this scale, have a look at ec_climate_data_base() to extract data from each file without loading the whole thing into memory.

Dates and times

The worst thing about historical climate data from Environment Canada is that the dates and times of hourly data are reported in local standard time. This makes it dubious to compare hourly data from one location to another. Because of this, the hourly output from Environment Canada is confusing (in my opinion), and so the hourly output from ec_climate_data() includes both the UTC time and the local time (in addition to the EC “local standard time”). These two times will disagree during daylight savings time, but the moment in time represented by both date_time_* columns is correct. To see these times in another timezone, use lubridate::with_tz() to change the tzone attribute. If you must insist on using “local standard time”, you can use a version of date + time_lst, but you may have to pretend that LST is UTC (I haven’t found an easy way to use a UTC offset as a timezone in R).

library(dplyr)

ec_climate_data(
  "KENTVILLE CDA CS NS 27141", timeframe = "hourly", 
  start = "1999-07-01", end = "1999-07-31"
) %>%
  select(date, time_lst, date_time_utc, date_time_local)
#> # A tibble: 744 x 4
#>          date time_lst       date_time_utc     date_time_local
#>        <date>   <time>              <dttm>              <dttm>
#>  1 1999-07-01 00:00:00 1999-07-01 04:00:00 1999-07-01 01:00:00
#>  2 1999-07-01 01:00:00 1999-07-01 05:00:00 1999-07-01 02:00:00
#>  3 1999-07-01 02:00:00 1999-07-01 06:00:00 1999-07-01 03:00:00
#>  4 1999-07-01 03:00:00 1999-07-01 07:00:00 1999-07-01 04:00:00
#>  5 1999-07-01 04:00:00 1999-07-01 08:00:00 1999-07-01 05:00:00
#>  6 1999-07-01 05:00:00 1999-07-01 09:00:00 1999-07-01 06:00:00
#>  7 1999-07-01 06:00:00 1999-07-01 10:00:00 1999-07-01 07:00:00
#>  8 1999-07-01 07:00:00 1999-07-01 11:00:00 1999-07-01 08:00:00
#>  9 1999-07-01 08:00:00 1999-07-01 12:00:00 1999-07-01 09:00:00
#> 10 1999-07-01 09:00:00 1999-07-01 13:00:00 1999-07-01 10:00:00
#> # ... with 734 more rows

Parsing problems

Not all values in Environment Canada historical climate CSVs are numeric. Occasionally, values in the form “>30”, or “<30” appear, particularly in the wind speed columns. Several such values appear in the November 2015 output from the Kentville NS station.

df <- ec_climate_data(
  "KENTVILLE CDA CS NS 27141", timeframe = "daily",
  start = "2015-11-01", end = "2015-11-30"
)
#> Warning in ec_climate_data("KENTVILLE CDA CS NS 27141", timeframe =
#> "daily", : One or more parsing error(s) occurred. See problems(...) or pass
#> value_parser = readr::parse_character to diagnose.

To have a look at the values that did not parse correctly, use problems() to extract the list of values:

problems(df)
#> # tibble [11 x 4]
#>      row   col expected actual
#>    <int> <int>    <chr>  <chr>
#>  1     3    28 a double    <31
#>  2     4    28 a double    <31
#>  3     5    28 a double    <31
#>  4     6    28 a double    <31
#>  5    10    28 a double    <31
#>  6    12    28 a double    <31
#>  7    18    28 a double    <31
#>  8    19    28 a double    <31
#>  9    25    28 a double    <31
#> 10    28    28 a double    <31
#> 11    30    28 a double    <31

The format for this output is from the readr package, and it may take a little sleuthing to figure out that the 28th column is, in fact, the spd_of_max_gust_km_h column (I did so using colnames(df)[28]). If these values are important to your analysis, you can circumvent the numeric parsing step using the value_parser argument.

ec_climate_data(
  "KENTVILLE CDA CS NS 27141", timeframe = "daily",
  start = "2015-11-01", end = "2015-11-30",
  value_parser = readr::parse_character
) %>%
  select(date, spd_of_max_gust_km_h, spd_of_max_gust_flag)
#> # A tibble: 30 x 3
#>          date spd_of_max_gust_km_h spd_of_max_gust_flag
#>        <date>                <chr>                <chr>
#>  1 2015-11-01                   35                 <NA>
#>  2 2015-11-02                   43                 <NA>
#>  3 2015-11-03                  <31                 <NA>
#>  4 2015-11-04                  <31                 <NA>
#>  5 2015-11-05                  <31                 <NA>
#>  6 2015-11-06                  <31                 <NA>
#>  7 2015-11-07                   50                 <NA>
#>  8 2015-11-08                   54                 <NA>
#>  9 2015-11-09                   46                 <NA>
#> 10 2015-11-10                  <31                 <NA>
#> # ... with 20 more rows

Flag information

Almost every column in the Environment Canada historical climate CSV output has a paired “flag” column, containing cryptic letters such as “M” or “E”, presumably having something to say about the value. The legend for these is included in the raw CSV output, and is included in the parsed output of ec_climate_data(), ec_climate_data_base() and ec_climate_long() as the attribute attr(, "flag_info"). For the Chelsea, QC example, accessing the flag information would look like this:

df <- ec_climate_data(
  "CHELSEA QC 5585", timeframe="daily", 
  start = "2015-01-01", end = "2015-12-31"
)

attr(df, "flag_info")
#> # A tibble: 14 x 2
#>       flag
#>      <chr>
#>  1       A
#>  2       C
#>  3       E
#>  4       F
#>  5       L
#>  6       M
#>  7       N
#>  8       S
#>  9       T
#> 10       Y
#> 11 [empty]
#> 12       ^
#> 13       †
#> 14       ‡
#> # ... with 1 more variables: description <chr>

This table is designed to be left_join()-able to the output of ec_climate_long().

Getting raw output

For huge climate data searches, it is probably not advisable to use ec_climate_data(), since this function loads all climate data into memory. It is possible to download one file at a time using ec_climate_data_base(), which one could do in a loop to avoid excessive memory use. In my year or so of using this package, I haven’t met a request too big to handle using ec_climate_data(), but I have received several emails from people attempting to do so.

The file cache

If you call ec_climate_data() to load climate data, you will notice a folder ec.cache has popped up in your working directory, which contains the cached files that were downloaded from the Environment Canada site. You can disable this by passing cache = NULL, but I don’t suggest it, since the cache will speed up running the code again should you make a mistake the first time (not to mention saving Environment Canada’s servers). To use a different folder, just pass the path as the cache argument or use set_default_cache() to do so for the whole session.

Using with mudata2

The rclimateca package can also output data in mudata format, which includes both location data and climate data in an easily plottable object.

library(mudata2)
md <- ec_climate_mudata(
  "CHELSEA QC 5585", timeframe = "daily", 
  start = "2015-01-01", end = "2015-12-31"
)

autoplot(md) +
  scale_x_date(date_labels = "%b")
#> Using x = "date", y = "value"

Useful resources

The Environment Canada historical climate data homepage is an excellent place to get additional information about what Environment Canada collects and why. In addition, the data glossary is a great place to get information about the specific parameters and flag values that occur within the dataset. The search historic data page may be useful to find the station/data you are looking for, and the bulk data documentation may be useful to more technical users making large requests.