Introduction to groupdata2

Ludvig Renbo Olsen

2017-10-22

Abstract

This vignette is an introduction to the package groupdata2.
groupdata2 is a set of subsetting methods for easy grouping, windowing, folding and splitting of data.  
For a more extensive description of groupdata2, please see Description of groupdata2  
 
Contact author at r-pkgs@ludvigolsen.dk  
 


Introduction

When working with data you sometimes want to divide it into groups and subgroups for processing or descriptive statistics. It can help reduce the amount of information, allowing you to compare measurements on different scales - e.g. income per year instead of per month. groupdata2 is a set of tools for creating groups from your data. It consists of five, easy to use, main functions, namely group_factor(), group(), splt(), partition(), and fold().

group_factor() is at the heart of it all. It creates the groups and is used by the other functions. It returns a grouping factor with group numbers, i.e. 1s for all elements in group 1, 2s for group 2, etc. So if you ask it to create 2 groups from a vector (‘Hans’,‘Dorte’,‘Mikkel’,‘Leif’) it will return a factor (1,1,2,2).

group() takes in either a dataframe or vector and returns a dataframe with a grouping factor added to it. The dataframe is grouped by the grouping factor (using dplyr::group_by), which makes it very easy to use in dplyr pipelines.
If, for instance, you have a column in a dataframe with quarterly measurements, and you would like to see the average measurement per year, you can simply create groups with a size of 4, and take the mean of each group, all within a 3-line pipeline.

splt() takes in either a dataframe or vector, creates a grouping factor, and splits the given data by this factor using base::split. Often it will be faster to use group() instead of splt(). I also find it easier to work with the output of group() .

partition() creates (optionally) balanced partitions (e.g. train/test sets) from given group sizes. It can balance partitions on one categorical variable and/or is able to keep all datapoints with a shared ID in the same partition.

fold() creates (optionally) balanced folds for cross-validation. It can balance folds on one categorical variable and/or is able to keep all datapoints with a shared ID in the same fold.

Use cases

I came up with too many use cases to present them all neatly in one vignette. To give each example more space I instead aim to create vignettes for each of them. For now, these are the available vignettes dealing with each their topic:

Cross-validation with groupdata2
In this vignette, we go through the basics of cross-validation, such as creating balanced train/test sets with partition() and balanced folds with fold(). We also write up a simple cross-validation function and compare multiple linear regression models.

Time series with groupdata2
In this vignette, we divide up a time series into groups (windows) and subgroups using group() with the ‘greedy’ and ‘staircase’ methods. We do some basic descriptive stats of each group and use them to reduce the data size.

Automatic groups with groupdata2
In this vignette, we will use the ‘l_starts’ method with group() to allow transferring of information from one dataset to another. We will use the automatic grouping function that finds group starts all by itself.

For a more extensive description of the features in groupdata2, see Description of groupdata2.

Outro

Well done, you made it to the end of this introduction to groupdata2! If you want to know more about the various methods and arguments, you can read the Description of groupdata2.
If you have any questions or comments to this vignette (tutorial) or groupdata2, please send them to me at
r-pkgs@ludvigolsen.dk, or open an issue on the github page https://github.com/LudvigOlsen/groupdata2 so I can make improvements.