Using Skimr

Elin Waring

2018-01-09

The skimr package is designed to provide summary statistics about variables. In base R the most similar functions are summary() for vectors and data frames and fivenum() for numeric vectors. Skimr is opinionated in its defaults but easy to modify.

For comparison purposes here are examples of the similar functions.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
summary(iris$Sepal.Length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900
fivenum(iris$Sepal.Length)
## [1] 4.3 5.1 5.8 6.4 7.9
summary(iris$Species)
##     setosa versicolor  virginica 
##         50         50         50

The skim function

The core function of skimr is skim(). Skim is a S3 generic function; the skim package includes support for data frames and grouped data frames. Like summary for data frames, skim presents results for all of the columns and the statistics depend on the class of the variable.

However, unlike summary.data.frame(), the printed results (those displayed in the console or in a knitted markdown file) are shown horizontally with one row per variable and separated into separate tibbles for each class of variables. The actual results are stored in a skim_df object that is also a tibble In summary.data.frame() the statistics are stored in a table with one column for each variable and the standard table printing is used to display the results.

library(skimr)
skim(iris)
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## 
## Variable type: factor 
##  variable missing complete   n n_unique                       top_counts
##   Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0
##  ordered
##    FALSE
## 
## Variable type: numeric 
##      variable missing complete   n mean   sd  p0 p25 median p75 p100
##  Petal.Length       0      150 150 3.76 1.77 1   1.6   4.35 5.1  6.9
##   Petal.Width       0      150 150 1.2  0.76 0.1 0.3   1.3  1.8  2.5
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1   5.8  6.4  7.9
##   Sepal.Width       0      150 150 3.06 0.44 2   2.8   3    3.3  4.4
##      hist
##  ▇▁▁▂▅▅▃▁
##  ▇▁▁▅▃▃▂▂
##  ▂▇▅▇▆▅▂▂
##  ▁▂▅▇▃▂▁▁

This distinction is important because the skim_df object is easy to use for additional manipulation if desired and is pipeable. For example all of the results for a particular statistic or for one variable could be selected or an alternative printing method sould be developed.

The skim_df object always contains 6 columns:

s <- skim(iris)
head(s, 15)
## # A tibble: 15 x 6
##    variable     type    stat     level   value formatted
##    <chr>        <chr>   <chr>    <chr>   <dbl> <chr>    
##  1 Sepal.Length numeric missing  .all    0     0        
##  2 Sepal.Length numeric complete .all  150     150      
##  3 Sepal.Length numeric n        .all  150     150      
##  4 Sepal.Length numeric mean     .all    5.84  5.84     
##  5 Sepal.Length numeric sd       .all    0.828 0.83     
##  6 Sepal.Length numeric p0       .all    4.30  4.3      
##  7 Sepal.Length numeric p25      .all    5.10  5.1      
##  8 Sepal.Length numeric median   .all    5.80  5.8      
##  9 Sepal.Length numeric p75      .all    6.40  6.4      
## 10 Sepal.Length numeric p100     .all    7.90  7.9      
## 11 Sepal.Length numeric hist     .all   NA     ▂▇▅▇▆▅▂▂ 
## 12 Sepal.Width  numeric missing  .all    0     0        
## 13 Sepal.Width  numeric complete .all  150     150      
## 14 Sepal.Width  numeric n        .all  150     150      
## 15 Sepal.Width  numeric mean     .all    3.06  3.06

skim() also supports grouped data. For grouped data one additional column for each grouping variable is added to the skim object.

mtcars %>%
  dplyr::group_by(gear) %>%
  skim()
## Skim summary statistics
##  n obs: 32 
##  n variables: 11 
##  group variables: gear 
## 
## Variable type: numeric 
##  gear variable missing complete  n   mean     sd     p0    p25 median
##     3       am       0       15 15   0      0      0      0      0   
##     3     carb       0       15 15   2.67   1.18   1      2      3   
##     3      cyl       0       15 15   7.47   1.19   4      8      8   
##     3     disp       0       15 15 326.3   94.85 120.1  275.8  318   
##     3     drat       0       15 15   3.13   0.27   2.76   3.04   3.08
##     3       hp       0       15 15 176.13  47.69  97    150    180   
##     3      mpg       0       15 15  16.11   3.37  10.4   14.5   15.5 
##     3     qsec       0       15 15  17.69   1.35  15.41  17.04  17.42
##     3       vs       0       15 15   0.2    0.41   0      0      0   
##     3       wt       0       15 15   3.89   0.83   2.46   3.45   3.73
##     4       am       0       12 12   0.67   0.49   0      0      1   
##     4     carb       0       12 12   2.33   1.3    1      1      2   
##     4      cyl       0       12 12   4.67   0.98   4      4      4   
##     4     disp       0       12 12 123.02  38.91  71.1   78.92 130.9 
##     4     drat       0       12 12   4.04   0.31   3.69   3.9    3.92
##     4       hp       0       12 12  89.5   25.89  52     65.75  94   
##     4      mpg       0       12 12  24.53   5.28  17.8   21     22.8 
##     4     qsec       0       12 12  18.96   1.61  16.46  18.46  18.75
##     4       vs       0       12 12   0.83   0.39   0      1      1   
##     4       wt       0       12 12   2.62   0.63   1.61   2.13   2.7 
##     5       am       0        5  5   1      0      1      1      1   
##     5     carb       0        5  5   4.4    2.61   2      2      4   
##     5      cyl       0        5  5   6      2      4      4      6   
##     5     disp       0        5  5 202.48 115.49  95.1  120.3  145   
##     5     drat       0        5  5   3.92   0.39   3.54   3.62   3.77
##     5       hp       0        5  5 195.6  102.83  91    113    175   
##     5      mpg       0        5  5  21.38   6.66  15     15.8   19.7 
##     5     qsec       0        5  5  15.64   1.13  14.5   14.6   15.5 
##     5       vs       0        5  5   0.2    0.45   0      0      0   
##     5       wt       0        5  5   2.63   0.82   1.51   2.14   2.77
##     p75   p100     hist
##    0      0    ▁▁▁▇▁▁▁▁
##    4      4    ▅▁▆▁▁▅▁▇
##    8      8    ▁▁▁▁▁▁▁▇
##  380    472    ▂▁▂▇▃▆▂▆
##    3.18   3.73 ▃▃▇▆▁▁▁▃
##  210    245    ▅▁▃▁▇▂▂▅
##   18.4   21.5  ▃▁▃▇▃▃▂▃
##   17.99  20.22 ▃▁▆▇▆▁▂▃
##    0      1    ▇▁▁▁▁▁▁▂
##    3.96   5.42 ▁▁▇▅▁▁▁▃
##    1      1    ▃▁▁▁▁▁▁▇
##    4      4    ▇▁▇▁▁▁▁▇
##    6      6    ▇▁▁▁▁▁▁▃
##  160    167.6  ▇▁▁▂▂▂▂▇
##    4.09   4.93 ▁▇▃▁▁▁▁▁
##  110    123    ▂▇▁▁▃▁▆▃
##   28.08  33.9  ▅▇▅▂▂▁▂▅
##   19.58  22.9  ▃▁▇▆▃▁▁▂
##    1      1    ▂▁▁▁▁▁▁▇
##    3.16   3.44 ▇▃▃▃▃▇▇▇
##    1      1    ▁▁▁▇▁▁▁▁
##    6      8    ▇▁▃▁▁▃▁▃
##    8      8    ▇▁▁▃▁▁▁▇
##  301    351    ▇▃▁▁▁▁▃▃
##    4.22   4.43 ▇▁▃▁▁▁▃▃
##  264    335    ▇▁▃▁▁▃▁▃
##   26     30.4  ▇▁▃▁▁▃▁▃
##   16.7   16.9  ▇▁▁▃▁▁▁▇
##    0      1    ▇▁▁▁▁▁▁▂
##    3.17   3.57 ▇▁▇▁▇▁▇▇

Individual columns from a data frame may be selected using tidyverse style selectors.

skim(iris, Sepal.Length, Species)
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## 
## Variable type: factor 
##  variable missing complete   n n_unique                       top_counts
##   Species       0      150 150        3 set: 50, ver: 50, vir: 50, NA: 0
##  ordered
##    FALSE
## 
## Variable type: numeric 
##      variable missing complete   n mean   sd  p0 p25 median p75 p100
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1    5.8 6.4  7.9
##      hist
##  ▂▇▅▇▆▅▂▂
skim(iris, starts_with("Sepal"))
## Skim summary statistics
##  n obs: 150 
##  n variables: 5 
## 
## Variable type: numeric 
##      variable missing complete   n mean   sd  p0 p25 median p75 p100
##  Sepal.Length       0      150 150 5.84 0.83 4.3 5.1    5.8 6.4  7.9
##   Sepal.Width       0      150 150 3.06 0.44 2   2.8    3   3.3  4.4
##      hist
##  ▂▇▅▇▆▅▂▂
##  ▁▂▅▇▃▂▁▁

If an individual column is of an unsuppported class it is treated as a character variable with a warning.

Alternatives to skim

The skim() function for a data frame returns a long, six column data frame. This long data frame is printed horizontally as a separate summary for each data type found in the data frame, but the object itself is not transformed during the print.

Three other functions are available that may prove useful as part of skim work flows.

The skim_tee() function produces the same printed version as skim() but returns the unmodified data frame. This allows for continued piping of the original data.

The skim_to_list() funtion returns of a list of the wide data frames for each data type. The data frames contain the formatted values, meaning that they are character data and most useful for display. In general users will want to store the results in an object for further handling.

The skim_to_wide() function returns a single data frame with each variable in a row. Variables that do not report a given statistic are assigned NA for that statistic. Formatted values are returned and all data are character.

Skimming vectors

The skim function also handles individual vectors that are not part of a data frame. For example the lynx data set is class ts.

skim(datasets::lynx)
## Skim summary statistics
## 
## Variable type: ts 
##        variable missing complete   n start  end frequency deltat    mean
##  datasets::lynx       0      114 114  1821 1934         1      1 1538.02
##       sd min  max median line_graph
##  1585.84  39 6991    771   ⡈⢄⡠⢁⣀⠒⣀⠔

If you attempt to use skim on a class that does not have support, it will coerce it to character (with a warning) and report number of NAs, number complete (non missing), number of rows, the number empty (i.e. “”), minimum length of non empty strings, maximum length of non empty strings, and number of unique values.

lynx <- datasets::lynx
class(lynx) <- "unkown_class"
skim(lynx)
## Warning: No summary functions for vectors of class: unkown_class.
## Coercing to character
## Skim summary statistics
## 
## Variable type: character 
##  variable missing complete   n min max empty n_unique
##      lynx       0      114 114   2   4     0      110

The skim_with function

Skimr is opinionated in its choice of defaults, but users can easily add too, replace, or remove the statistics for a class.

To add a statistic use the a named list for each class using the format below.

classname = list(mad_name = mad)
skim_with(numeric = list(mad_name = mad))
skim(datasets::chickwts)
## Skim summary statistics
##  n obs: 71 
##  n variables: 2 
## 
## Variable type: factor 
##  variable missing complete  n n_unique                         top_counts
##      feed       0       71 71        6 soy: 14, cas: 12, lin: 12, sun: 12
##  ordered
##    FALSE
## 
## Variable type: numeric 
##  variable missing complete  n   mean    sd  p0   p25 median   p75 p100
##    weight       0       71 71 261.31 78.07 108 204.5    258 323.5  423
##      hist mad_name
##  ▃▅▅▇▃▇▂▂    91.92

The skim_with_defaults() function resets the list to the defaults. By default skim_with() appends the new statstics, but setting append = FALSE replaces the defaults.

skim_with_defaults()
skim_with(numeric = list(mad_name = mad), append = FALSE)
skim(datasets::chickwts)
## Skim summary statistics
##  n obs: 71 
##  n variables: 2 
## 
## Variable type: factor 
##  variable missing complete  n n_unique                         top_counts
##      feed       0       71 71        6 soy: 14, cas: 12, lin: 12, sun: 12
##  ordered
##    FALSE
## 
## Variable type: numeric 
##  variable mad_name
##    weight    91.92
skim_with_defaults() # Reset to defaults

You can also use skim_with() to remove specific statistics by setting them to NULL.

skim_with(numeric = list(hist = NULL))
skim(datasets::chickwts)
## Skim summary statistics
##  n obs: 71 
##  n variables: 2 
## 
## Variable type: factor 
##  variable missing complete  n n_unique                         top_counts
##      feed       0       71 71        6 soy: 14, cas: 12, lin: 12, sun: 12
##  ordered
##    FALSE
## 
## Variable type: numeric 
##  variable missing complete  n   mean    sd  p0   p25 median   p75 p100
##    weight       0       71 71 261.31 78.07 108 204.5    258 323.5  423
skim_with_defaults() #

Formatting individual values

Skimr does opinionated formatting of the statistics displayed when printing. These values are stored in the formatted column of the skim_df object and are always character. Skim attempts to use a reasonable number of decimal places for calculated values based on the data type (integer or numeric) and number of stored decimals. For statistics such as max() and min() the actual stored values are displayed. Decimals in a column are aligned. Date formats are used for date statistics.

Users override the formats using the skim_format() function. Using show_formats() will display the current options in use for each data type. Using skim_format_defaults() will reset the formats to their default settings.

Rendering the results of skim()

The skim_df object is a long data frame with one row for each combination of variable and statistic (and optionally for group). The horizontal display is created by default using print.skim_df(). This can be called explicitly by applying the print() function to a skim_df object which allows passing in of options. In addition kable() andpander()are supported. These both provide more control over the rendered results, particularly when used to render in conjunction with knitr. Documentation of these options for these functions is covered in more detail in the knitr package for kable() and the pander package for pander(). Using either of these may require use of document or chunk options and fonts.

This topic is addressed in more detail in the Using Fonts vignette.

skim(iris) %>% kable()

Skim summary statistics
n obs: 150
n variables: 5

Variable type: factor

variable missing complete n n_unique top_counts ordered
Species 0 150 150 3 set: 50, ver: 50, vir: 50, NA: 0 FALSE

Variable type: numeric

variable missing complete n mean sd p0 p25 median p75 p100 hist
Petal.Length 0 150 150 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▁▂▅▅▃▁
Petal.Width 0 150 150 1.2 0.76 0.1 0.3 1.3 1.8 2.5 ▇▁▁▅▃▃▂▂
Sepal.Length 0 150 150 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▂▇▅▇▆▅▂▂
Sepal.Width 0 150 150 3.06 0.44 2 2.8 3 3.3 4.4 ▁▂▅▇▃▂▁▁
library(pander)
panderOptions('knitr.auto.asis', FALSE)
skim(iris) %>% pander() 
## Skim summary statistics  
##    n obs: 150    
##  n variables: 5    
## 
## ------------------------------------------------
##  variable   missing   complete    n    n_unique 
## ---------- --------- ---------- ----- ----------
##  Species       0        150      150      3     
## ------------------------------------------------
## 
## Table: Table continues below
## 
##  
## ------------------------------------------
##            top_counts             ordered 
## -------------------------------- ---------
##  set: 50, ver: 50, vir: 50, NA:    FALSE  
##                0                          
## ------------------------------------------
## 
## 
## ----------------------------------------------------------------------------
##    variable     missing   complete    n    mean    sd    p0    p25   median 
## -------------- --------- ---------- ----- ------ ------ ----- ----- --------
##  Petal.Length      0        150      150   3.76   1.77    1    1.6    4.35  
## 
##  Petal.Width       0        150      150   1.2    0.76   0.1   0.3    1.3   
## 
##  Sepal.Length      0        150      150   5.84   0.83   4.3   5.1    5.8   
## 
##  Sepal.Width       0        150      150   3.06   0.44    2    2.8     3    
## ----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------
##  p75   p100     hist   
## ----- ------ ----------
##  5.1   6.9    ▇▁▁▂▅▅▃▁ 
## 
##  1.8   2.5    ▇▁▁▅▃▃▂▂ 
## 
##  6.4   7.9    ▂▇▅▇▆▅▂▂ 
## 
##  3.3   4.4    ▁▂▅▇▃▂▁▁ 
## -----------------------

Solutions to common rendering problems

The details of rendering are dependent on the operating system R is running on, the locale of the installation, and the fonts installed. Rendering may also differ based on whether it occurs in the console or when knitting to specific types of documents such as HTML and PDF. The most commonly reported problems involve rendering the spark graphs (inline histogram). This section will summarize known issues.

Currently pander() does not support inline_histograms on Windows. Also, Windows does not support spark line graphs.

In order to render the spark graphs in html or PDF histogram you may need to change fonts to one that supports blocks or braille (depending on which you need). Please review the separate vignette and associated template for details on this.