Accelerometer data processing with GGIR

Vincent van Hees

May 21 2017

1 Introduction

1.1 What is GGIR?

GGIR is an R-package to process multi-day raw accelerometer data for physical activity and sleep research. The term raw in raw accelerometry refers to data being expressed in m/s2 or gravitational acceleration as opposed to the previous generation accelerometers which stored data in accelerometer brand specific units. The signal processing includes automatic calibration, detection of sustained abnormally high values, detection of non-wear and calculation of average magnitude of dynamic acceleration based on a variety of metrics. Next, GGIR uses this information to describe the data per day of measurement or per measurement, including estimates of physical activity, inactivity and sleep.

1.2 Who has been using GGIR?

GGIR is increasingly being used by a number of research groups acros the world. A non-exhaustive overview of academic publications related to GGIR can be found here.

A non-exhaustive list of academic institutes that have been using GGIR:

United Kingdom: University College London, Exeter University, Newcastle University, Edge Hill University, University of Cambridge, University of Chester, Edinburgh Napier University, University of Birmingham, Liverpool John Moores University | United States: University College San Diego, Iowa State University, University of Southern California, University of Michican, West Virginia University, Arizona State University, University of Utah, National Institute of Health | The Netherlands: Erasmus Medical Centre Rotterdam, VU University Medical Centre, Leiden University Medical Centre | France: French Institute for Health and Medical Research, Institute of Myology, University of Dijon, University of Montpellier | Brazil: Federal University of Pelotas | Australia: Newcastle University, Australian Catholic University, University of Sydney, Queensland University of Technology, University of Western Australia | Spain: University of Granada | Norway: University of Oslo | Sweden: University of Lund | Switszerland: Centre Hospitalier Universitaire Vaudois | Germany: Technische Universität Dresden, Technical University Munich

1.3 Acknowledgements

R package GGIR would not have been possible without the support of the following people, institutions, and funders:

People:

Institutions:

Funders:

1.4 How can I contribute to the GGIR development?

The development version of GGIR can be found on github, which is also where you will find guidance on how to contribute to the package development.

2 Setting up your work environment

2.1 Install R and RStudio

Download and install R

Download and install RStudio (optional, but recommended)

Download GGIR with its dependencies, you can do this with one command from the console command line:

install.packages("GGIR", dependencies = TRUE)

2.2 Prepare folder structure

  1. GGIR works with the following accelerometer brands and formats:
    • GENEActiv .bin and .csv
    • ActiGraph .csv. Note for Actigraph users: In ActiLife you have the option to export data with timestamps. Please do not do this as this causes memory issues. To cope with the absense of timestamps GGIR will re-caculate timestamps from the sample frequency and the start time and date as presented in the file header.
    • Axivity .wav and .cwa
    • Genea (an accelerometer that is not available anymore, but which was used for some studies between 2007 and 2012) .bin and .csv
  2. All accelerometer data that needs to be analysed should be stored in one folder, or subfolders of that folder.
  3. Give the folder an appropriate name, preferable with a reference to the study or project it is related to rather than just ‘data’, because the name of this folder will be used later on as an identifier of the dataset.

2.3 GGIR shell function

Copy paste the following code in a new R script (file ending with .R). It is a shell functions that will allow you to have all your decisions in one place without having to worry about separate scripts and configurations.

library(GGIR)
g.shell.GGIR(#=======================================
             # INPUT NEEDED:
             mode=c(1,2,3,4,5),
             datadir="C:/mystudy/mydata",
             outputdir="D:/myresults",
             f0=1, f1=2,
             #-------------------------------
             # Part 1:
             #-------------------------------
             # Key functions: reading file, auto-calibration, and extracting features
             do.enmo = TRUE,             do.anglez=TRUE,
             chunksize=1,                printsummary=TRUE,
             #-------------------------------
             # Part 2:
             #-------------------------------
             strategy = 2,               ndayswindow=7,
             hrs.del.start = 0,          hrs.del.end = 0,
             maxdur = 9,                 includedaycrit = 16,
             winhr = c(5,10),
             qlevels = c(c(1380/1440),c(1410/1440)),
             qwindow=c(0,24),
             ilevels = c(seq(0,400,by=50),8000),
             mvpathreshold =c(100,120),
             bout.metric = 4,
             closedbout=FALSE,
             #-------------------------------
             # Part 3:
             #-------------------------------
             # Key functions: Sleep detection
             timethreshold= c(5),        anglethreshold=5,
             ignorenonwear = TRUE,
             #-------------------------------
             # Part 4:
             #-------------------------------
             # Key functions: Integrating sleep log (if available) with sleep detection
             # storing day and person specific summaries of sleep
             excludefirstlast = TRUE,
             includenightcrit = 16,
             def.noc.sleep = c(),
             loglocation= "C:/mydata/sleeplog.csv",
             outliers.only = TRUE,
             criterror = 4,
             relyonsleeplog = FALSE,
             sleeplogidnum = TRUE,
             colid=1,
             coln1=2,
             do.visual = TRUE,
             nnights = 9,
             #-------------------------------
             # Part 5:
             # Key functions: Merging physical activity with sleep analyses
             #-------------------------------
             threshold.lig = c(30), threshold.mod = c(100),  threshold.vig = c(400),
             boutcriter = 0.8,      boutcriter.in = 0.9,     boutcriter.lig = 0.8,
             boutcriter.mvpa = 0.8, boutdur.in = c(1,10,30), boutdur.lig = c(1,10),
             boutdur.mvpa = c(1),   timewindow = c("WW"),
             #-----------------------------------
             # Report generation
             #-------------------------------
             # Key functions: Generating reports based on meta-data
             do.report=c(2,4,5),
             visualreport=TRUE,     dofirstpage = TRUE,
             viewingwindow=1)

2.4 Updating the GGIR shell input arguments

The function arguments need to be tailored to your experimental protocol. There are many function arguments you can specify which are explained in the package tutorial. GGIR is structured in 5 parts and the arguments to g.shell.GGIR can be structured accordingly:

By looking up the corresponding functions g.part1, g.part2, g.part3, g.part4, and g.part5 you can see what arguments are possible. All of these arguments are also accepted by the shell function g.shell.GGIR, because g.shell.GGIR is nothing more than a wrapper around those functions.

Below I have highlighted a few of the key arguments you may want to be aware of. Please see the package manual for more detailed documentation:

2.4.1 do.cal

Autocalibration is the process of investigating the acceleration signals for calibration error based on free-living data. do.cal is a boolean argument to indicate whether autocalibration needs to be performed. If set to TRUE (default) GGIR will propose calibration correction factors to minimize calibration error.

2.4.2 datadir

GGIR needs to know where your data is. It will detect automatically from what accelerometer brand the data comes from and in what data format the information is stored

2.4.3 strategy

Argument strategy allows you to give GGIR your knowledge about the study design: - strategy = 1: Exclude ‘hrs.del.start’ number of hours at the beginning and ‘hrs.del.end’ number of hours at the end of the measurement and never allow for more than ‘maxdur’ number of hours. These three parameters are set by their respective function arguments. - strategy = 2 makes that only the data between the first midnight and the last midnight is used for imputation. - strategy = 3 only selects the most active X days in the files. X is specified by argument ‘ndayswindow’

2.4.4 do.imp

GGIR detects when the accelerometer is not worn. Argument ‘do.imp’ indicates whether you want those missing periods to be imputed by measurements from similar timepoints at different days of the measurment or not.

2.4.5 loglocation

If you applied a sleeplog in your experiments then this can be used by GGIR to improve the sleep estimations. Argument ‘loglocation’ is the location of the spreadsheet (csv) with sleep log information. The spreadsheet needs to have the following structure: one column for participant id, and then followed by alternatingly one column for onset time and one column for waking time (see example below). There can be multiple sleeplogs in the same spreadsheet. The first raw of the spreadsheet needs to be filled with column names, it does not matter what these column names are. Timestamps are to be stored without date as in hh:mm:ss. If onset corresponds to lights out or intention to fall asleep, then it is the end-users responsibility to account for this in the interpretation of the results.

3 Time for action: How to run your analysis?

3.1 From the console

You can use

source("pathtoscript/myshellscript.R")

or use the Source button in RStudio if you use RStudio.

3.2 In a cluster

It is possible to run GGIR on a computing cluster to process multiple files in parallel. The way I did it is as follows, please note that some of these commands are specific to the computing cluster you are working on. Please consult your local cluster specilist to tailor this to your situation. In my case, I had three files:

submit.sh

for i in {1..707}; do
    n=1
    s=$(($(($n * $[$i-1]))+1))
    e=$(($i * $n))
    qsub /home/nvhv/WORKING_DATA/bashscripts/run-mainscript.sh $s $e
done

run-mainscript.sh

#! /bin/bash
#$ -cwd -V
#$ -l h_vmem=12G
/usr/bin/R --vanilla --args f0=$1 f1=$2 < /home/nvhv/WORKING_DATA/test/myshellscript.R

myshellscript.R

options(echo=TRUE)
args = commandArgs(TRUE)
if(length(args) > 0) {
  for (i in 1:length(args)) {
    eval(parse(text = args[[i]]))
  }
}
g.shell.GGIR(f0=f0,f1=f1,...)

You will need to update the … in the last line with the arguments you used for g.shell.GGIR. Note that f0=f0,f1=f1 is essential for this to work. The values of f0 and f1 are passed on from the bash script.

Once this is all setup you will need to call bash submit.sh from the command line. Note: Please make sure that you process one GGIR part at the same time on a cluster, because each part assumes that preceding parts have been ran. You can make sure of this by always specifying argument mode to a single part of GGIR. Once the analysis stops update argument mode to the next part until all parts are done. The speed of the parallel processing is obviously dependent on the capacity of your computing cluster and the size of your dataset. I have been able to process all 4000 files from the Whitehall II Study in just a couple of hours.

4 Inspecting the results

4.1 Output part 2

Part 2 generates:

4.1.1 Dictionary of variables in part2_summary.csv

Variable Description
ID Participant id
device_sn Device serial number
bodylocation Body location extracted from file header
filename Name of the data file
start_time Start time experiment
startday Day of the week on which measurement started
samplefreq Sample frequency (Hz)
device Accelerometer brand, e.g. GENEACtiv
clipping_score The Clipping score: Fraction of 15 minute windows per file for which the acceleration in one of the three axis was close to the maximum for at least 80% of the time. This should be 0.
Measurement duration (days) -
complete_24hcycle Completeness score: Fraction of 15 minute windows per 24 hours for which no valid data is available at any day of the measurement.
meas_dur_def_proto_day measurement duration according to protocol (days): Measurement duration (days) minus the hours that are ignored at the beginning and end of the measurement motived by protocol design
wear_dur_def_proto_day wear duration duration according to protocol (days): So, if the protocol was seven days of measurement then wearing the accelerometer for 8 days and recording data for 8 days will still makethat the wear duration is 7 days
calib_err Calibration error (static estimate) Estimated based on all ‘non-movement’ periods in the measurement after applying the autocalibration.
calib_status Calibration status: Summary statement about the status of the calibration error minimisation
ENMO (only available if set to true in part1.R) ENMO is the main summary measure of acceleration. The value presented is the average ENMO over all the available data normalised per 24 hour cycles, with invalid data imputed by the average at similar timepoints on different days of the week. In addition to ENMO it is possible to extract other acceleration metrics in part1.R (i.e. BFEN, HFEN, HFENplus) See also van Hees PLoSONE April 2013 for a detailed description and comparison of these techniques.
pX_A_mg_0-24h This variable represents the Xth percentile in the distribution of short epoch metric value A of the average day. The average day may not be ideal. Therefore, the code also extracts similar variables per day and then takes the averages over days (see daysummary)
L5_A_mg_0-24 Average of metric A during the least active five* hours in the day that is the lowest rolling average value of metric A. (* window size is modifiable in part2.R)
M5_A_mg_0-24 Average of metric A during the most active five* hours in the day that is the lowest rolling average value of metric A. (* window size is modifiable in part2.R)
L5hr_A_mg_0-24 Starting time in hours and fractions of hours of L5_A_mg_0-24
M5hr_A_mg_0-24 Starting time in hours and fractions of hours of M5_A_mg_0-24
1to6am_ENMO_mg Average metric value ENMO between 1am and 6am
N valid WEdays Number of valid weekend days
N valid WKdays Number of valid week days
IS_interdailystability inter daily stability
IV_intradailyvariability intra daily variability
IVIS_windowsize_minutes Sizes of the windows based on which IV and IS are calculated (note that this is modifiable)
IVIS_epochsize_seconds size of the epochs based on which IV and IS are calculated (note that this is modifiable)
AD_… All days (plain average of all available days, no weighting). The variable … was calculated per day and then averaged over all the available days
WE_… Weekend days (plain average of all available days, no weighting). The variable … was calculated per day and then averaged over weekend days only
WD_… Week days (plain average of all available days, no weighting). The variable … was calculated per day and then averaged over week days only
WWE_… Weekend days (weighted average) The variable … was calculated per day and then averaged over weekend days. Double weekend days are averaged. This is only relevant for experiments that last for more than seven days.
WWD_… Week days (weighted average) The variable … was calculated per day and then averaged over week days. Double weekend days were averaged. This is only relevant for experiments that last for more than seven days)
WWD_MVPA_E5S_T100_ENMO Time spent in moderate-to-vigorous based on 5 second epoch size and an ENMO metric threshold of 100
WWE_MVPA_E5S_B1M80%_T100_ENMO Time spent in moderate-to-vigorous based on 5 second epoch size and an ENMO metric threshold of 100 based on a bout criteria of 100
WE_[100,150)_mg_0-24h_ENMO Time spent between (and including) 100 mg and 150 (excluding 150 itself) between 0 and 24 hours (the full day) using metric ENMO data exclusion strategy (value=1, ignore specific hours; value=2, ignore all data before the first midnight and after the last midnight)
n hours ignored at start of meas (if strategy=1) number of hours ignored at the start of the measurement (if strategy = 1) A log of decision made in part2.R
n hours ignored at end of meas (if strategy=1) number of hours ignored at the end of the measurement (if strategy = 1). A log of decision made in part2.R
n hours ignored at end of meas (if strategy=1) number of days of measurement after which all data is ignored (if strategy = 1) A log of decision made in part2.R
epoch size to which acceleration was averaged (seconds) A log of decision made in part1.R
pdffilenumb Indicator of in which pdf-file the plot was stored
pdfpagecount Indicator of in which pdf-page the plot was stored

4.1.2 Dictionary of variables in part2_daysummary.csv

Tis is a non-exhaustive list, because most concepts have been explained in summary.csv

Variables Description
ID Participant id
filename Name of the data file
calender_date Timestamp and date on which measurement started
bodylocation Location of the accelerometer as extracted from file header
N valid hours Number of hours with valid data in the day
N hours Number of hours of measurement in a day, which typically is 24, unless it is a day on which the clock changes (DST) resulting in 23 or 25 hours. The value can be less than 23 if the measurement started or ended this day
weekday Name of weekday
measurement Day of measurement Day number relative to start of the measurement
L5hr_ENMO_mg_0-24h Hour on which L5 starts for these 24 hours (defined with metric ENMO)
L5_ENMO_mg_0-24h Average acceleration for L5 (defined with metric ENMO)
[A,B)_mg_0-24h_ENMO Time spent in minutes between (and including) acceleration value A in mg and (excluding) acceleration value B in mg based on metric ENMO

4.2 Output part 4

Part 4 generates the following output:

When input argument do.visual is set to TRUE GGIR can show the following visual comparison between the time window of being asleep (or in bed) according to the sleeplog and the detected sustained inactivity bouts according to the accelerometer data. This visualisation is stored in the results folder as visualisation_sleep.pdf.

Explanation of the image: Each line represents one night. Colors are used to dinstinguish definitions of sustianed inacitivty bouts (2 definitions in this case) and to indicate existence or absense of overlap with the sleeplog. When argument outliers.only is set to FALSE it will visualise all available nights in the dataset. If outliers.only is set to TRUE it will visualise only nights with a difference in onset or waking time between sleeplog and sustained inactivity bouts larger than the value of argument criterror.

This visualisation with outliers.only set to TRUE and critererror set to 4 was very powerful to identify entry errors in sleeplog data in van Hees et al PLoSONE 2015. We had over 25 thousand nights of data, and this visualisation allowed us to quickly zoom in on the most problematic nights to investigate possible mistakes in GGIR or mistakes in data entry.