bigQueryR

Mark Edmondson

2017-10-23

Introduction

This is a package for interating with BigQuery from within R.

You may want instead to use bigrquery which is more developed with integration with dplyr etc. Some functions from bigrquery are used in this package.

Why this package then?

This package is here as it uses googleAuthR as backend, so has Shiny support, and compatibility with other googleAuthR dependent packages.

It also has support for data extracts to Google Cloud Storage, meaning you can download data and make the download URL available to a user via their Google email. If you do a query normally with over 100000 results it hangs and errors.

An example of a BigQuery Shiny app running OAuth2 is here, the BigQuery Visualiser

Authentication

Authentication is as used in other googleAuthR libraries:

library(bigQueryR)

## this will open your browser
## Authenticate with an email that has access to the BigQuery project you need
bqr_auth()

## verify under a new user
bqr_auth(new_user=TRUE)

If you are authenticating under several APIs via googleAuthRthen use gar_auth() instead with the appropriate scopes set.

You can also use service-to-service JSON files and multi-user authentication under Shiny, see the googleAuthR readme for details.

Listing BigQuery meta data

Various functions for listing what is in your BigQuery account.

library(bigQueryR)
  
## this will open your browser
## Authenticate with an email that has access to the BigQuery project you need
bqr_auth()
  
## verify under a new user
bqr_auth(new_user=TRUE)
  
## get projects
projects <- bqr_list_projects()
  
my_project <- projects[1]
  
## for first project, get datasets
datasets <- bqr_list_datasets[my_project]

my_dataset <- datasets[1]
## list tables
my_table <- bqr_list_tables(my_project, my_dataset)

## get metadata for table
meta_table <- bqr_table_meta(my_project, my_dataset, my_table)

Simple Queries

You can pass in queries that have results under ~ 100000 rows using this command:

bqr_query("big-query-r","samples",
          "SELECT COUNT(repository.url) FROM [publicdata:samples.github_nested]")

More than that, and the API starts to hang and you are limited by your download bandwidth.

Asynchronous Queries

For bigger queries, asynchronous queries save the results to another BigQuery table. You can check the progress of the job via bqr_get_job

library(bigQueryR)

## Auth with a project that has at least BigQuery and Google Cloud Storage scope
bqr_auth()

## make a big query
job <- bqr_query_asynch("your_project", 
                        "your_dataset",
                        "SELECT * FROM blah LIMIT 9999999", 
                        destinationTableId = "bigResultTable")
                        
## poll the job to check its status
## its done when job$status$state == "DONE"
bqr_get_job("your_project", job$jobReference$jobId)

##once done, the query results are in "bigResultTable"

You may now want to download this data. For large datasets, this is best done via extracting the BigQuery result to Google Cloud Storage, then downloading the data from there.

You can create a bucket at Google Cloud Storage see https://cloud.google.com/storage/docs/cloud-console, or you can use library(googleCloudStorageR)

Once created, you can extract your data via the below:

## Create the data extract from BigQuery to Cloud Storage
job_extract <- bqr_extract_data("your_project",
                                "your_dataset",
                                "bigResultTable",
                                "your_cloud_storage_bucket_name")
                                
## poll the extract job to check its status
## its done when job$status$state == "DONE"
bqr_get_job("your_project", job_extract$jobReference$jobId)

## to download via a URL and not logging in via Google Cloud Storage interface:
## Use an email that is Google account enabled
## Requires scopes:
##  https://www.googleapis.com/auth/devstorage.full_control
##  https://www.googleapis.com/auth/cloud-platform
## set via options("bigQueryR.scopes") and reauthenticate if needed

download_url <- bqr_grant_extract_access(job_extract, "your@email.com")

## download_url may be multiple if the data is > 1GB
> [1] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000000.csv"
> [2] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000001.csv"
> [3] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000002.csv"