explore

Roland Krasser

2019-08-27

The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code!

There are three ways to use the package:

explore package on Github: https://github.com/rolkra/explore

As the explore-functions fits well into the tidyverse, we load the dplyr-package as well.

library(dplyr)
library(explore)

Interactive data exploration

Explore your dataset (in this case the iris dataset) in one line of code:

explore(iris)

A shiny app is launched, you can inspect individual variable, explore their relation to a binary target, grow a decision tree or create a fully automated report of all variables with a few “mouseclicks”.

You can choose each variable containng 0/1, FALSE/TRUE or “no”/“yes” as a target. As the iris dataset doesn’t contain a binary target, we create one:

iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris %>% explore()

Report variables

Create a rich HTML report of all variables with one line of code:

# report of all variables
iris %>% report(output_file = "report.html", output_dir = tempdir())

Or you can simply add a target and create the report. In this case we use a binary tharget, but a categorical or numerical target would work as well.

# report of all variables and their relationship with a binary target
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris %>% 
  report(output_file = "report.html", 
         output_dir = tempdir(),
         target = is_versicolor)

If you use a binary tharget, the parameter split = FALSE will give you a different view on the data.

Grow a decision tree

Grow a decision tree with one line of code:

iris %>% select(-Species) %>% explain_tree(target = is_versicolor)

You can control the growth of the tree using the parameters maxdepth, minsplit and cp.

Explore dataset

Explore your table with one line of code to see which type of variables it contains.

iris %>% explore_tbl()

You can also use describe_tbl() if you just need the main facts without visualisation.

iris %>% describe_tbl()
#> 150 observations with 6 variables
#> 0 variables containing missings (NA)
#> 0 variables with no variance

Explore variables

Explore a variable with one line of code. You don’t have to care if a variable is numerical or categorical.

iris %>% explore(Species)

iris %>% explore(Sepal.Length)

Explore variables with a target

Explore a variable and its relationship with a binary target with one line of code. You don’t have to care if a variable is numerical or categorical.

iris %>% explore(Sepal.Length, target = is_versicolor)

Using split = FALSE will change the plot to %target:

iris %>% explore(Sepal.Length, target = is_versicolor, split = FALSE)

The target can have more than two levels:

iris %>% explore(Sepal.Length, target = Species)

Or the target can even be numeric:

iris %>% explore(Sepal.Length, target = Petal.Length)

Explore multiple variables

iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  explore_all()

iris %>% 
  select(Sepal.Length, Sepal.Width, is_versicolor) %>% 
  explore_all(target = is_versicolor)

iris %>% 
  select(Sepal.Length, Sepal.Width, is_versicolor) %>% 
  explore_all(target = is_versicolor, split = FALSE)

iris %>% 
  select(Sepal.Length, Sepal.Width, Species) %>% 
  explore_all(target = Species)

iris %>% 
  select(Sepal.Length, Sepal.Width, Petal.Length) %>% 
  explore_all(target = Petal.Length)

Explore correlation between two variables

Explore correlation between two variables with one line of code:

iris %>% explore(Sepal.Length, Petal.Length)

You can add a target too:

iris %>% explore(Sepal.Length, Petal.Length, target = is_versicolor)

Other options

If you use explore to explore a variable and want to set lower and upper limits for values, you can use the min_val and max_val parameters. All values below min_val will be set to min_val. All values above max_val will be set to max_val.

iris %>% explore(Sepal.Length, min_val = 4.5, max_val = 7)

explore uses auto-scale by default. To deactivate it use the parameter auto_scale = FALSE

iris %>% explore(Sepal.Length, auto_scale = FALSE)

Describing data

Describe your data in one line of code:

iris %>% describe()
#>        variable type na na_pct unique min mean max
#> 1  Sepal.Length  dbl  0      0     35 4.3 5.84 7.9
#> 2   Sepal.Width  dbl  0      0     23 2.0 3.06 4.4
#> 3  Petal.Length  dbl  0      0     43 1.0 3.76 6.9
#> 4   Petal.Width  dbl  0      0     22 0.1 1.20 2.5
#> 5       Species  fct  0      0      3  NA   NA  NA
#> 6 is_versicolor  dbl  0      0      2 0.0 0.33 1.0

The result is a data-frame, where each row is a variable of your data. You can use filter from dplyr for quick checks:

# show all variables that contain less than 5 unique values
iris %>% describe() %>% filter(unique < 5)
#>        variable type na na_pct unique min mean max
#> 1       Species  fct  0      0      3  NA   NA  NA
#> 2 is_versicolor  dbl  0      0      2   0 0.33   1
# show all variables contain NA values
iris %>% describe() %>% filter(na > 0)
#> [1] variable type     na       na_pct   unique   min      mean     max     
#> <0 rows> (or 0-length row.names)

You can use describe for describing variables too. You don’t need to care if a variale is numerical or categorical. The output is a text.

# describe a numerical variable
iris %>% describe(Species)
#> variable = Species 
#> type     = factor
#> na       = 0 of 150 (0%)
#> unique   = 3
#>  setosa     = 50 (33.3%)
#>  versicolor = 50 (33.3%)
#>  virginica  = 50 (33.3%)
# describe a categorical variable
iris %>% describe(Sepal.Length)
#> variable = Sepal.Length 
#> type     = double 
#> na       = 0 of 150 (0%)
#> unique   = 35
#> min|max  = 4.3 | 7.9
#> q05|q95  = 4.6 | 7.3
#> q25|q75  = 5.1 | 6.4
#> median   = 5.8 
#> mean     = 5.8

Data Dictionary

Create a Data Dictionary of a dataset (Markdown File data_dict.md)

iris %>% data_dict_md(output_dir = tempdir())

Add title, detailed descriptions and change default filename

description <- data.frame(
                  variable = c("Species"), 
                  description = c("Species of Iris flower"))
data_dict_md(iris, 
             title = "iris flower data set", 
             description =  description, 
             output_file = "data_dict_iris.md",
             output_dir = tempdir())

Basic data cleaning

To clean a variable you can use clean_var. With one line of code you can rename a variable, replace NA-values and set a minimum and maximum for the value.

iris %>% 
  clean_var(Sepal.Length, 
            min_val = 4.5, 
            max_val = 7.0, 
            na = 5.8, 
            name = "sepal_length") %>% 
  describe()
#>        variable type na na_pct unique min mean max
#> 1  sepal_length  dbl  0      0     26 4.5 5.81 7.0
#> 2   Sepal.Width  dbl  0      0     23 2.0 3.06 4.4
#> 3  Petal.Length  dbl  0      0     43 1.0 3.76 6.9
#> 4   Petal.Width  dbl  0      0     22 0.1 1.20 2.5
#> 5       Species  fct  0      0      3  NA   NA  NA
#> 6 is_versicolor  dbl  0      0      2 0.0 0.33 1.0

Connecting to a datawarehouse

The explore package comes with a set easy to remember function to connect, read and write from/to a datawarehouse (dwh) using odbc.

# connect to a dwh(odbc DSN must be defined)
dwh <- dwh_connect("DWH_DSN")

# if you need to pass user and password
dwh <- dwh_connect("DWH_DSN", 
                    user = "myuser",
                    pwd = rstudioapi::askForPassword()
                  )

# read table from a dwh
data <- dwh_read_table(dwh, "db.tablename")

# read data from a dwh using sql
data <- dwh_read_data(dwh, sql = "select * from db.tablename")

# disconnect from dwh
dwh_disconnect(dwh)

To write large data to a dwh you can use dwh_fastload(). It connects to a dwh, writes the data and disconnects.

# connect to a dwh(odbc DSN must be defined)
data  %>% dwh_fastload("DWH_DSN", "db.tablename")