psycho for R

Dominique Makowski

2018-02-05

Abstract

Psycho is an R package that aims at providing tools for psychologists, neuropsychologists and neuroscientists, to transform statistical outputs into something readable that can be, almost directly, copied and pasted into a report. It also implements various functions useful in psychological science, such as correlation matrices, assessment plot creation or normalization. The package revolves around the psychobject. Main functions from the package return this type, and the analyze() function transforms other R objects into psychobjects. Four functions can then be applied on a psychobject: summary(), print(), plot() and values(). Contrary to many other packages which goal is to produce statistical analyzes, psycho aims at filling the gap between statistical R outputs and statistical report writing, with a focus on APA formatting guidelines, to enhance the standardization of results reporting. Complex outputs, such as those of Bayesian and frequentist mixed models, are automatically transformed into readable text, tables, and plots that illustrate the effects. Thus, the results can easily be incorporated into shareable reports and publications, promoting data exploration, saving time and preventing errors for better, reproducible, science.


Overview

Installation

# Do this once (uncomment if needed)
# install.packages("devtools") 
# library(devtools)
# devtools::install_github("https://github.com/neuropsychology/psycho.R")

# Load psycho (at the beginning of every script)
library(psycho)

General Workflow

The package mainly revolves around the psychobject. Main functions from the package return this type, and the analyze() function transforms other R objects into psychobjects. Then, 4 functions can be applied on a psychobject: summary(), print(), plot() and values().


Examples

Correlation Table and Plot

It is possible to quickly run a correlation analysis on a dataframe with the flexible and powerful correlation() function.

Basic Correlations

library(psycho)

df <- iris

cor <- psycho::correlation(df, 
                           type = "full",
                           method = "pearson",
                           adjust = "none")

summary(cor)
Sepal.Length Sepal.Width Petal.Length
Sepal.Length
Sepal.Width -0.12
Petal.Length 0.87*** -0.43***
Petal.Width 0.82*** -0.37*** 0.96***

You can save this correlation matrix using write.csv(print(cor), "correlation_table.csv"). That is very useful to copy/paste it from excel to a paper or a report :)

You can also draw a quick visualization:

plot(cor)

Partial, Corrected, Correlations

correlation() offers the possibility to run partial or semi-partial correleations, as well as printing them pairwise.

library(psycho)

df <- iris

pcor <- psycho::correlation(df, 
                           type = "partial",
                           method = "pearson",
                           adjust = "bonferroni")

summary(pcor)
             Sepal.Length Sepal.Width Petal.Length
Sepal.Length                                      
Sepal.Width       0.63***                         
Petal.Length      0.72***    -0.62***             
Petal.Width      -0.34***     0.35***      0.87***

You can also have access to the individual correlations as follows:

print(pcor)
Pearson Partial Correlation (p value correction: bonferroni):
   - Sepal.Length - Sepal.Width:   Results of the Pearson correlation showed a significant and strong negative association between Sepal.Length and Sepal.Width (r(148) = 0.63, p < .001***).
   - Sepal.Length - Petal.Length:   Results of the Pearson correlation showed a significant and strong negative association between Sepal.Length and Petal.Length (r(148) = 0.72, p < .001***).
   - Sepal.Width - Petal.Length:   Results of the Pearson correlation showed a significant and strong positive association between Sepal.Width and Petal.Length (r(148) = -0.62, p < .001***).
   - Sepal.Length - Petal.Width:   Results of the Pearson correlation showed a significant and moderate positive association between Sepal.Length and Petal.Width (r(148) = -0.34, p < .001***).
   - Sepal.Width - Petal.Width:   Results of the Pearson correlation showed a significant and moderate negative association between Sepal.Width and Petal.Width (r(148) = 0.35, p < .001***).
   - Petal.Length - Petal.Width:   Results of the Pearson correlation showed a significant and strong negative association between Petal.Length and Petal.Width (r(148) = 0.87, p < .001***).

Standardize / Normalize / Z-score / Scale

The standardize() function allows you to easily scale and center all numeric variables of a dataframe. It is similar to the base function scale(), but presents some advantages: it is tidyverse-friendly, data-type friendly (i.e., does not transform it into a matrix) and can handle dataframes with categorical data.

library(psycho)
library(tidyverse)

iris %>% 
  select(Species, Sepal.Length, Petal.Length) %>% 
  psycho::standardize() %>% 
  summary()
       Species    Sepal.Length       Petal.Length    
 setosa    :50   Min.   :-1.86378   Min.   :-1.5623  
 versicolor:50   1st Qu.:-0.89767   1st Qu.:-1.2225  
 virginica :50   Median :-0.05233   Median : 0.3354  
                 Mean   : 0.00000   Mean   : 0.0000  
                 3rd Qu.: 0.67225   3rd Qu.: 0.7602  
                 Max.   : 2.48370   Max.   : 1.7799  

Assess

This function is useful in clinical activity. It is sometimes necessary to show to the patient, his family or other members of staff, a visual representation of his score. The assess() function also computes the percentile and the Z-score, often needed for neuropsychological reports.

library(psycho)

results <- psycho::assess(124, mean=100, sd=15)

# Print it
print(results)
The participant (score = 124) is positioned at 1.6 standard deviations from the mean (M = 100, SD = 15). The participant's score is greater than 94.35 % of the general population.
# Plot it
plot(results)

It is also possible to custom the plot a bit (see the documentation for available parameters).

library(psycho)

results <- psycho::assess(85, mean=100, sd=15, linecolor = "orange", fillcolor = "#4CAF50")

# Plot it
plot(results)


Custom Plots

In general, the plot() function returns, most of the times, a ggplot object. That means it remains quite flexible. Here’s an example.

library(psycho)

# Let's create a correlation plot
p <- plot(psycho::correlation(iris))

# Custom theme and colours
p <- p + 
  scale_fill_gradientn(colors = c("#4CAF50", "#FFEB3B", "#FF5722")) +
  ylab("Variables\n") +
  labs(fill = "r") +
  theme(plot.background = element_rect(fill = "#607D8B"),
        axis.title.y = element_text(size = 20, angle = 90, colour="white"),
        axis.text = element_text(size = 15, colour="white"),
        legend.title = element_text(size = 20, colour="white"),
        legend.text = element_text(size = 15, colour="white"),
        title = element_text(size = 16, colour="white"))
p


How many factors/components to retain?

The n_factors() function is useful in before running principal component (PCA) or factor (FA) analysis. As many statistical methods exists to that purpose, this function gathers them together and gives an overview on the most frequent result. It also draw a nice plot with the eigenvalues and the proportion of explained variance.

results <- attitude %>%
  select_if(is.numeric) %>% 
  psycho::n_factors()

# Get a summary
summary(results)
n.Factors n.Methods Eigenvalues Exp.Variance Cum.Variance
1 5 3.7163758 0.5309108 0.5309108
2 3 1.1409219 0.1629888 0.6938997
3 1 0.8471915 0.1210274 0.8149270
4 0 0.6128697 0.0875528 0.9024798
5 0 0.3236728 0.0462390 0.9487188
6 0 0.2185306 0.0312187 0.9799375
7 0 0.1404378 0.0200625 1.0000000

We can also extract the final result (the optimal number of factors) for each method:

Method n_optimal
Optimal Coordinates 1
Acceleration Factor 1
Parallel Analysis 1
Eigenvalues (Kaiser Criterion) 2
Velicer MAP 1
BIC 2
Sample Size Adjusted BIC 3
VSS Complexity 1 1
VSS Complexity 2 2

And, of course, plot it :)

plot(results)


Analyze the Mixed-Modelling Framework

This is possibly the most important function of the psycho package. Its goal is to transform complex outputs of complex statistical routines into something readable, interpretable, and formatted. It is designed to work with frequentist and Bayesian mixed models, which is the central statistical method for psychological science.

Data Creation

Let’s start by creating a dataframe similar to those found in psychological science.

set.seed(666)
df <- data.frame(Participant = as.factor(rep(1:25, each = 4)), 
                 Item = rep_len(c("i1", "i2", "i3", "i4"), 100), 
                 Condition = rep_len(c("A", "B", "A", "B", "B"), 20), 
                 Error = as.factor(sample(c(0, 1), 100, replace = T)),
                 RT = rnorm(100, 30, .2), 
                 Stress = runif(100, 3, 5))

# Standardize the numeric variables.
df <- psycho::standardize(df)

# Take a look  at the first 10 rows
head(df)
Participant Item Condition Error RT Stress
1 i1 A 1 0.2610666 -1.5032539
1 i2 B 0 1.2180393 -1.3381086
1 i3 A 1 -0.6122813 -1.6359922
1 i4 B 0 -0.5209097 -0.1893384
2 i1 B 0 -0.4227089 -0.2383872
2 i2 A 1 -0.7291277 1.1834562

This dataframe contains the data of 25 participants (labelled from 1 to 25). Each saw 4 different items (i1-i4) in two conditions (A and B). We measured, for each item, if the response was correct or not (Error), its reaction time (RT) and the stress associated with the trial.

Ancient Approach

In order to investigate the effect of the condition on the reaction time RT, the traditional, ancient and obsolete routine is 1) to compute the mean for each participant, and 2) run an ANOVA.

# Format data
df_for_anova <- df %>% 
  dplyr::group_by(Participant, Condition) %>% 
  dplyr::summarise(RT = mean(RT))

# Run the anova
anova <- aov(RT ~ Condition + Error(Participant), df_for_anova)
summary(anova)

Error: Participant
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 24  15.13  0.6304               

Error: Within
          Df Sum Sq Mean Sq F value Pr(>F)
Condition  1  0.725  0.7254   1.102  0.304
Residuals 24 15.793  0.6580               

As we can see, the effect of the condition is not significant (unsuprisingly, as data was generated randomly). One of the many flaws of this approach is that we lose information about intra-individual and item-related variability.

Mixed Linear Regressions (lme4, lmerTest)

The use of the mixed-modelling framework allows us to add the items as random factors.

library(lme4)

fit <- lme4::lmer(RT ~ Condition + (1|Participant) + (1|Item), data=df)

# Traditional output
summary(fit)
Linear mixed model fit by REML ['lmerMod']
Formula: RT ~ Condition + (1 | Participant) + (1 | Item)
   Data: df

REML criterion at convergence: 282.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8414 -0.6905 -0.0882  0.7010  2.4840 

Random effects:
 Groups      Name        Variance Std.Dev.
 Participant (Intercept) 0.0000   0.0000  
 Item        (Intercept) 0.1036   0.3219  
 Residual                0.9251   0.9618  
Number of obs: 100, groups:  Participant, 25; Item, 4

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.09239    0.22142   0.417
ConditionB  -0.15398    0.19633  -0.784

Correlation of Fixed Effects:
           (Intr)
ConditionB -0.532

As the output is a bit messy, the analyze() function will munge this into something nicely formatted.

results <- psycho::analyze(fit)

# We can extract a formatted summary table
summary(results, round = 2)
Effect_Size Coef SE t df Coef.std SE.std p
(Intercept) Very Small 0.09 0.22 0.42 5.81 0.00 0.0 0.69
ConditionB Very Small -0.15 0.20 -0.78 95.00 -0.08 0.1 0.43

We can also print it in a text format!

print(results)
The overall model predicting ... successfully converged and explained 10.57% of the variance of the endogen (the conditional R2). The variance explained by the fixed effects was of 0.56% (the marginal R2) and the one explained by the random effects of 10.01%.
   - The effect of (Intercept) was [NOT] significant (beta = 0.092, SE = 0.22, t(5.81) = 0.42, p > .1) and can be considered as very small (std. beta = 0, std. SE = 0).
   - The effect of ConditionB was [NOT] significant (beta = -0.15, SE = 0.20, t(95.00) = -0.78, p > .1) and can be considered as very small (std. beta = -0.076, std. SE = 0.097).

Bayesian Mixed Linear Regressions (rstanarm)

However, as the frequentist framework is criticized, it is advised to switch to the Bayesian framework. Unfortunately, interpretation of these models remain unfamiliar to regular psychologists. But stay calm, because analyze() handles this for you.

library(rstanarm)

fit <- rstanarm::stan_lmer(RT ~ Condition + (1|Participant) + (1|Item), data=df)

# Traditional output
results <- psycho::analyze(fit, effsize=T)
summary(results, round=2)
Variable MPE Median MAD Mean SD 95_CI_lower 95_CI_higher Very_Large Large Medium Small Very_Small Opposite
(Intercept) 64.18 0.09 0.26 0.10 0.34 -0.57 0.72 0 0.02 0.05 0.26 0.31 0.36
ConditionB 79.50 -0.16 0.19 -0.16 0.19 -0.52 0.23 0 0.00 0.04 0.38 0.38 0.20
print(results)
We fitted a Markov Chain Monte Carlo [type] model to predict[Y] with [X] (formula = RT ~ Condition + (1 | Participant) + (1 | Item)).Priors were set as follow: [INSERT INFO ABOUT PRIORS].

   - Concerning the effect of (Intercept), there is a probability of 64.18% that its coefficient is between 0 and 3.05 (Median = 0.092, MAD = 0.26, 95% CI [-0.57, 0.72], MPE = 64.18%).
   - Based on Cohen (1988) recommandations, there is a probability of 0.48% that this effect size is very large, 1.98% that this effect size is large, 5.30% that this effect size is medium, 25.50% that this effect size is small, 30.93% that this effect is very small and 35.83% that it has an opposite direction (between 0 and 3.2e-05).

   - Concerning the effect of ConditionB, there is a probability of 79.50% that its coefficient is between -0.87 and 0 (Median = -0.16, MAD = 0.19, 95% CI [-0.52, 0.23], MPE = 79.50%).
   - Based on Cohen (1988) recommandations, there is a probability of 0% that this effect size is very large, 0.050% that this effect size is large, 3.52% that this effect size is medium, 38.12% that this effect size is small, 37.80% that this effect is very small and 20.50% that it has an opposite direction (between 0 and 0.52).

We can also plot the effects:

plot(results)

Obviously, you need to learn more about Bayesian analyses before running them. You can find more information in the rstanarm’s vignettes.