mlbgameday: Database Connections

Kris Eberwein

2018-04-30

Relational databases provide an easy way to work with data that are larger than memory. Gameday data can grow rather large, so the mlbgameday package processes the data in “chunks” to alleviate memory overhead. Despite this technique, large payload requests can be highly system intensive. The authors of mlbgameday suggest loading no more than a single season per R session.

Connection Types

The package uses database connections from the DBI package. The package documentation uses SQLite for examples, but DBI provides connection methods for most popular databases. Note, we will need to download the appropriate package to use the various connections. The various packages include:

Creating a Gameday Database

The easiest way to get going is to create an empty database and call the get_paylaod() function with the new database’s connection in the db_conn argument. There is no need to create a database schema before-hand. The data are returned as integer, double, or character data types, and most databases will recognize that. An example using SQLite would look like this:

library(mlbgameday)
library(RSQLite)
library(DBI)

# Create SQLite database.

con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "gameday.sqlite3")

Now that the empty database is created in our working directory, we can populate it. For the sake of brevity, we will search for only the games played in the 2016 World Series.

# Create a list of game_ids for the 2016 World Series.

game_ids <- search_gids(start = "2016-10-01", end = "2016-12-01", game_type = "w")

# Gather data and write to database.

get_payload(game_ids = game_ids, dataset = "inning_all", db_con = con)

# Re-connect to database since the get_payload() function closed the connection.

con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "gameday.sqlite3")

# Take a look at our tables.

dbListTables(con)

Working with the Database

There are several methods for working with data in-database. Most commonly, if the data are larger than memory, we would want to subset or parse it before bringing it into our R environment. The most straight-forward way to do this is to use our database connection to perform a standard SQL query. The query below creates a data frame containing all the pitches thrown by Andrew Miller in the 2016 World Series.

# Bring data from database into memory via standard SQL.
# Query a data frame of all pitches thrown by Andrew Miller in the 2016 World Series.

millerTime <- dbGetQuery(con,
                "SELECT * FROM pitch p
                LEFT JOIN atbat a ON a.num = p.num
                AND a.url = p.url
                WHERE a.pitcher_name = 'Andrew Miller'")

If we prefer not to work with SQL, the dbplyr package provides SQL backends that allow us to work with standard dplyr verbs. The data aren’t pulled from the database until we call the collect() function, therefore no memory is wasted by processing the data. For example, if we wanted to see all of Kyle Schwarber’s at-bats during the 2016 World Series, we could do the following:

library(dbplyr)
library(dplyr)

atbat <- tbl(con, "atbat")
pitch <- tbl(con, "pitch")

# Notice pitch and atbat are not data frames, rather they are database connections.

summary(atbat)

query <- pitch %>% left_join(atbat, by = c("num", "url")) %>% 
    filter(batter_name == "Kyle Schwarber")
# Construct a query using dplyr verbs. Notice here that the "query" still hasn't loaded any data.

query <- pitch %>% left_join(atbat, by = c("num", "url")) %>% 
    filter(batter_name == "Kyle Schwarber")

# Once we have the subset we want, we can pull that into the R enviornment as a data frame.

schwarber <- collect(query)