Installation

DEVis can be installed using the following commands. This section will be updated after CRAN submission.

#Install DESeq2 dependency from bioconductor.
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")

#Install devtools to allow installation from GitHub
if (!require("devtools")) install.packages("devtools")

#Install DEvis from GitHub repository.
devtools::install_github("price0416/DEvis/DEvis")

#Load the package.
library(DEVis)

Introduction

The DEVis packakge is a comprehensive toolset for data aggregation, visual analytics, exploratory analysis, and project management that builds upon the DESeq2 differential expression package. DEVis provides a framework for the analysis of multi-factor experiments, increasing the power and potential of differential expression analysis while reducing workload and potential for human error. This package implements a wide range of data visualization tools, as well as data aggregation, transformation, and selection methods that smoothly integrate with the DESeq2, making extensive analysis and interpretation a more seamless process. DEVis also implements an automated result management system that automatically organizes and stores results, further simplifying the challenge of complex transcriptomic analysis and the creation of publication-quality figures. The visualizations provided by DEVis incorporate parameters that can be used to provide control over the data being visualized in sensible ways, with each visualization offering several layout and color schemes, unique filtering, sorting, and subset parameters for displaying and retrieving data, making DEVis visualizations powerful functions for navigating and retrieving transcriptomic data within the context of a complex experiment, rather than just simple plots.

Data Aggregation

Data aggregation is one of the core features of the DEVis package that makes it possible to examine and extract meaningful data from complex result sets. For example, analysis of time-series data currently requires the identification of differentially expressed genes for data from each time point with regard to a control sample, which may or may not also consist of time controlled data. This type of experiment will produce a potentially different set of differentially expressed genes for each time point contrast, meaning that direct comparisons between gene sets will not be immediately possible. DEVis makes it possible to combine multiple result sets based on experimental conditions using set-based merging. This aggregated result set can then be used to examine changes in expression across multiple result sets and provides a master result set that can be filtered, subset, sorted, and visualized using other DEVis methods.

Visualization

DEVis provides a set of visualization tools for exploring and displaying transcriptomic data sets. Visualizations are configurable through user defined parameters, which provide options for data merging, subsetting, sorting, and selection, allowing DEVis visualizations to be used as data subset and retrieval tools. All plots can be saved as high-resolution png or pdf files, and include several color and layout themes that can be used to customize resulting plots and produce publication-ready figures.

Project Management

Several data organization functions are built into the DEVis package. A directory structure is created on initialization containing folders to house data files, such as differentially expressed gene lists, and visualized plots. Users can toggle whether data and plots should be automatically saved to the appropriate directory whenever a visualization is employed, standardizing project results and simplifying project management. Additionally, several functions for exporting selected data are implemented in DEVis.

DEVis Analysis

Example Data

For the purposes of this vignette, we will using sample data from a study titled, “Ebolaviruses Associated with Differential Pathogenicity Induce Distinct Host Responses in Human Macrophages” (Olejnik, J. et. al, 2017).

The aim of this experiment was to uncover differences in host response of human monocyte-derived macrophages (MDMs) to infection with the highly pathogenic EBOV and the presumably nonpathogenic RESTV. Ebola virus (EBOV) and Reston virus (RESTV) are members of the Ebolavirus genus which greatly differ in their pathogenicity. However, while EBOV causes a severe disease in humans, there are no reported disease-associated human cases of RESTV infection, suggesting that RESTV is nonpathogenic for humans.

Macrophages derived from 3 different donors were either infected with Ebola virus (EBOV/Kikwit-95), or infected with Reston virus (RESTV). As controls, cells were treated with lipopolysaccharide (LPS) or were mock infected, with total cellular RNA isolated for analysis at 6h, 1d, and 2d post-infection.

Summary of Data
Sample Type Treatment 6 Hours 1 Day 2 Days
Case Reston Virus (RESTV) 3 3 3
Case Ebola Virus (EBOV) 3 3 3
Control Uninfected (Mock) 3 4 3
Control Lipopolysaccharide (LPS) 3 3 3

Input Data Format

DEVis requires as input data a matrix of counts for sequencing reads/fragments and a metadata matrix that provides information regarding samples. There are several ways to arrive at such a matrix, such as alignment of data using a read aligner such as STAR and extraction of read counts using the rSubread package’s featureCounts function.

For the purposes of this vignette, this count matrix will be referred to as count data, and the accompanying metadata will be referred to as target data, as the primary function of the metadata matrix is to allow for selective targeting of count data. For example, the target data may contain a field identifying which sample belongs to a control condition. Using that information, we can target the corresponding count data for the control condition.

Count data and target data must therefore correspond to one another. That is, the columns of count data, each of which represent a single sample, must correspond to the rows in the target data. This is due to the way that DESeq2 structures data as an extention of the summarizedExperiment data format. For more in depth information, see the DESeq2 vignette on preparing count data and the summarizedExperiment data format.

Examples of count and target data are provided below. In this case, 4 columns of count data are shown and 4 rows of target data are shown. Note the correspondance between row and column names.

Sample format for count data:
Donor10_EBOV_1D_LPS_6H_S27 Donor10_EBOV_1D_S21 Donor10_EBOV_2D_S25 Donor10_EBOV_6H_S17
1060P11.3 0 0 0 0
A1BG 65 101 87 93
A1BG_AS1 132 113 212 93
A1CF 0 0 0 0
A2M 14149 11579 10482 11431
A2M_AS1 5 0 10 0
Sample format for target data:
donor treatment time sample targetID
Donor10_EBOV_1D_LPS_6H_S27 m10 ebov restim 27 Donor10_EBOV_1D_LPS_6H_S27
Donor10_EBOV_1D_S21 m10 ebov 1d 21 Donor10_EBOV_1D_S21
Donor10_EBOV_2D_S25 m10 ebov 2d 25 Donor10_EBOV_2D_S25
Donor10_EBOV_6H_S17 m10 ebov 6h 17 Donor10_EBOV_6H_S17

Note that the targetID field will automatically be generated by the prep_targets() function to correspond to the row IDs provided in the target metadata file.

Initialization

Let’s get started on our example analysis by initializing a new project and loading data. For the purposes of this example we will assume that a count file and target file are available and data will be read from these files.

library(DEVis)

#Define the base directory for this analysis.  
base_dir <- "/path/to/base/"

#Generate the directory structure for results.
create_dir_struct(base_dir)

#Specify input directories.
cnt_dir        <- "/path/to/base/counts/"
tgt_dir        <- "/path/to/base/targets/"
init_data_paths(cnt_dir, tgt_dir)

#Specify count and target file names.
count_matrix  <- "count_matrix.txt"   
target_matrix <- "target_matrix.csv"   

#Set significance cutoffs for p-values and log fold-change.
init_cutoffs(p_signif=0.05, lfc_cut=1.5)

#Read count and target data 
count_data  <- prep_counts(count_matrix)
target_data <- prep_targets(target_matrix, delim="c")

#Initialize output mode.  
set_output_mode("both")

This represents a typical project initialization. Here, the base directory, as well as count and target directories are defined. When the create_dir_struct(base_dir) function is run, the directory structure for housing DEVis results is created. Note that if this directory already exists that it will not be overwritten upon subsequent runs of this command at the same base directory.

The following directory structure is created using ~DEVis as a base directory:

#' ~/DEVis/
#'   /results/
#'     /DE/
#'       /boxplot/
#'       /counts/
#'       /data/
#'       /density_plots/
#'       /divergence/
#'       /heatmaps/
#'       /profile_plots/
#'       /series_plots/
#'       /volcano/
#'    /dendrograms/
#'    /geneplots/
#'    /group_stats/
#'    /MDS/
#'      /standard/
#'      /hulls/
#'    /sample_distance/
#'      /euclidian/
#'      /poisson/

Next, cutoffs are initialized. The init_cutoffs() function initializes the p-value cutoffs and log2foldChange values for most visualizations, filtering, and aggregation steps. Significant differentially expressed genes will be initially determined based on the provided p-value cutoff, and when appropriate, functions that employ a minimum log fold-change cutoff will use the provided log fold-change cutoff initialized here. In this example we are setting a 0.05 minimum p-value for differentially expressed genes, however this could be reduced to identify only more statistically significant genes if desired.

The output mode is then initialized, which determines how DEVis will display and store visualizations. Three output modes are available, “screen”, “file”, or “both”. Screen will plot visualizations to the screen only, whereas “file” will save the visualization to the appropriate directory in the previously generated directory structure. The “both” option will save the visualization to file and also display it on screen. The “screen” option is ideal for exploratory analysis, as not every visualization will require storage. The “file” option can be used when automating analyses with DEVis. The “both” option is ideal for most analyses, however, automatically organizing and storing data as you proceed with analysis.

Finally, the count and target data are read. DEVis provides functions for reading count and target data in tab or comma delimited formats. In addition to assuring the data are properly formatted, these functions provide the visualizations and functions in DEVis access to count and target data so that behind-the-scenes calculations and reformatting can be performed when necessary without the user having to keep track of formats, subsets, and data transformations required for visualization. For this reason, many functions require that these functions be executed prior to analysis as a part of initialization.

Preparing composite fields for experimental design

Once initialization is complete, we can prepare for differential expression calculation. The first step is to determine the contrasts that need to be made based on the target data. In this case, we have fields representing donor, treatment, time, and sampleID available in our target data. For an experiment such as this, however, it is likely that we will want to look at multiple factors. In this case, we are interested in looking at treatment (infection with evoc, lps, restv, or mock samples) and time points, so we will first create a composite column of target data that merges these two fields.

  • Note: This is a good time in analysis to consider any other fields you may want to look at in tandem downstream. By making several composite fields that may be of interest now, additional options for exploratory analysis will be available later.

Starting target data:

~>head(target_data)
donor treatment time sample targetID
Donor10_EBOV_1D_LPS_6H_S27 m10 ebov restim 27 Donor10_EBOV_1D_LPS_6H_S27
Donor10_EBOV_1D_S21 m10 ebov 1d 21 Donor10_EBOV_1D_S21
Donor10_EBOV_2D_S25 m10 ebov 2d 25 Donor10_EBOV_2D_S25
Donor10_EBOV_6H_S17 m10 ebov 6h 17 Donor10_EBOV_6H_S17
Donor10_LPS_1D_S20 m10 lps 1d 20 Donor10_LPS_1D_S20
Donor10_LPS_2D_S24 m10 lps 2d 24 Donor10_LPS_2D_S24
#Create a composite field for treatment and time.
target_data <- make_composite_field(c("time","treatment"))

#Let's also create a composite field for treatment, time, and donor ID.
target_data <- make_composite_field(c("time","treatment", "donor"))

Composite fields added to target data:

~>head(target_data)
donor treatment time sample targetID time_treatment time_treatment_donor
Donor10_EBOV_1D_LPS_6H_S27 m10 ebov restim 27 Donor10_EBOV_1D_LPS_6H_S27 restim_ebov restim_ebov_m10
Donor10_EBOV_1D_S21 m10 ebov 1d 21 Donor10_EBOV_1D_S21 1d_ebov 1d_ebov_m10
Donor10_EBOV_2D_S25 m10 ebov 2d 25 Donor10_EBOV_2D_S25 2d_ebov 2d_ebov_m10
Donor10_EBOV_6H_S17 m10 ebov 6h 17 Donor10_EBOV_6H_S17 6h_ebov 6h_ebov_m10
Donor10_LPS_1D_S20 m10 lps 1d 20 Donor10_LPS_1D_S20 1d_lps 1d_lps_m10
Donor10_LPS_2D_S24 m10 lps 2d 24 Donor10_LPS_2D_S24 2d_lps 2d_lps_m10

Filtering data

In this experiment, some samples time point data was unclear at sequencing time, and those samples were marked “restim” in in the target data’s “time” column. For this case, we want to remove these samples from both our count and target data so they are not included in the analysis. DEVis provides a function for this kind of filtering.

#Keep timepoints 1d, 2d, and 6h, but throw away "restim" samples.
subset  <- keep_data_subset(counts=count_data,
                                targets=target_data,
                                target_count_id_map="targetID",
                                target_keep_col="time",
                                target_keep_val=c("1d","2d","6h"))

#Update count and target data to the filtered set.
count_data  <- subset[[1]]
target_data <- subset[[2]]

The data are now filtered to have only the samples we are interested in. For more information on this function, see ?keep_data_subset.

DESeq2 Object Preparation

We can now use the composite column we created previously, “time_treatment”, as the experimental design for differential expression. You can generally determine what your design field should be by considering the contrasts you want to make with your data. In this case, we want to ask, “How do data differ for patients infected with different viruses at different time points?” Based on this question, we can know to use a composite field that takes both time and treatment into account for our experimental design. Design choice can be a deep topic, however, for more information on experimental designs, see the DESeq2 vignette.

#Prepare our DESeq2 Object based on our data and experimental design.
dds <- prep_dds_from_data(count_input=count_data, 
                          target_input=target_data, 
                          experiment_design= ~ time_treatment,
                          stabilization="vst")

The prep_dds_from_data() function provided by DEVis combines several steps involved in creating and preparing a DESeq2 object. This function creates the DESeq2 object based on count and target data using the provided experimental design. It also performs TMM normalization, and variance or rlog stabilization. This function can optionally merge replicates as well. For more information see, ?prep_dds_from_data.

Once this step is complete, DEVis should have calculated and stored all of the necessary information to begin visualization of the data.

Examination of Data as a Whole

Before proceeding to differential expression analysis, it’s best to look at the data set as a whole first to see overall patterns, find outliers, and identify batch effects. DEVis offers several tools to achieve this.

Distance Analysis (Two-way Hierarchical Clustering)

Let’s first take a look at the euclidian distances between samples.

plot_euclid_dist(row_labels="time_treatment_donor", filename="euclidian_distance.pdf", 
                 theme=2, returnData=FALSE)

This plot performs two-way hierarchical clustering of a euclidian distance matrix and displays the results as a heat map. By using the previously created composite field, “time_treatment_donor”, we can see how strongly each factor influences each sample. In this symmetrical plot, each row (or column because of symmetry) represents a single sample and its relative distance to all other samples in the data set. The diagonal line that passes through the center of the plot with 0 distance between samples indicates the sample represented in that row.

In this plot, an outlying sample would be identified by its entire row/column showing very high distance (very light yellow color). In those cases, the sample could be eliminated from the data set with keep_data_subset() function provided with DEVis. The sample, “1d_mock_10”, has some potential to be an outlying sample in this case, and does not cluster similarly to other mock samples. It does however, appear to have some similarity with samples in the upper left clustered block, and is not uniformly distant from all samples. This puts this sample in a grey area, and its removal or inclusion in the data set is up to the researcher in this case. With this being a mock sample, however, it is quite possible that the inclusion or exclusion of this sample will impact the results of differential expression. It is, at this point, something to keep in mind as analysis continues. For the purposes of this vignette, we will include this sample in subsequent analysis.

The euclidian distance plot is also useful in indicating batch effects. For instance, if these data were prepared in two seperate labs or by three different technicians on four different sequence runs, that information could be included in the targets file and used as the labeling parameter here. Strong clustering by a particular batch metric would indicate that the data set may be biased and subject to a batch effect.

In this case, we see that a strong factor appears to be the donor. This makes sense because individuals are likely to be most similar to themselves regardless of the change of a few hours or their infection status. Next we can see that many of the “treatment” factors also tend to cluster together, indicating that after donor id, the type of infection seems to be the next cause of similarity between samples. There appears to be a strong clustering effect between restv and mock samples (seen in the large upper left cluster) and between the ebola and lps samples (lower right). As mocks are essentially serving as the control samples in this experiment, this might indicate that the restv type treatment does not cause as substantial differences in gene expression as do ebov and lps treatments.

This first step begins to tell the story of this data, providing a high-level view of the overall similarity for the data set as a whole, and allowing us to ask if the story the data tells agrees with a the expectations we might have for our experiment. In this case, so far the data appear to make sense, however if this plot behaved differently than what we would expect, we would need to investigate the cause of that discrepancy, or adjust our hypothesis to account for it.

As with all DEVis visualizations, the generated plot will be automatically saved to the appropriate output directory, in this case, to the /sample_distance/ folder as a .pdf file.

  • Note: A similar plot that uses poisson distance measurements instead of euclidian distance can be used as well. See ?plot_poisson_dist() for more information. Furthermore, independent one-way hierarchical clustering dendrograms without heatmaps can also be generated using the plot_dendro function, see ?plot_dendro for more information.

Multi-dimensional Scaling Analysis (MDS)

Next, let’s look at the data through the lens of multi-dimensional scaling, in which each sample is given a x/y coordinate in an arbitrary space. MDS plots are excellent for finding separation and overlap between sub-groups within data, and by drawing MDS plots with reference to different subgroups, it is possible to identify the factor responsible for large scale changes in a data set.

#Create a MDS plot showing effects of treatment and time.
plot_mds(filename="MDS.pdf", 
         color_var="treatment", 
         shape_var="time", 
         theme=1)