valr

Build Status AppVeyor Build Status Coverage Status DOI

valr provides tools to read and manipulate genome intervals and signals, similar to the BEDtools suite. valr enables analysis in the R/RStudio environment, leveraging modern R tools in the tidyverse for a terse, expressive syntax. Compute-intensive algorithms are implemented in Rcpp/C++, and many methods take advantage of the speed and grouping capability provided by dplyr.

Installation

The latest stable version can be installed from CRAN:

install.packages('valr')

The latest development version can be installed from github:

# install.packages("devtools")
devtools::install_github('rnabioco/valr')

Why valr?

Why another tool set for interval manipulations? Based on our experience teaching genome analysis, we were motivated to develop interval arithmetic software that faciliates genome analysis in a single environment (RStudio), eliminating the need to master both command-line and exploratory analysis tools.

Note: valr can currently be used for analysis of pre-processed data in BED and related formats. We plan to support BAM and VCF files soon via tabix indexes.

Familiar tools, natively in R

The functions in valr have similar names to their BEDtools counterparts, and so will be familiar to users coming from the BEDtools suite. Unlike other tools that wrap BEDtools and write temporary files to disk, valr tools run natively in memory. Similar to pybedtools, valr has a terse syntax:

library(valr)
library(dplyr)

snps <- read_bed(valr_example('hg19.snps147.chr22.bed.gz'), n_fields = 6)
genes <- read_bed(valr_example('genes.hg19.chr22.bed.gz'), n_fields = 6)

# find snps in intergenic regions
intergenic <- bed_subtract(snps, genes)
# find distance from intergenic snps to nearest gene
nearby <- bed_closest(intergenic, genes)

nearby %>%
  select(starts_with('name'), .overlap, .dist) %>%
  filter(abs(.dist) < 5000)
#> # A tibble: 1,047 x 4
#>    name.x      name.y   .overlap .dist
#>    <chr>       <chr>       <int> <int>
#>  1 rs530458610 P704P           0  2579
#>  2 rs2261631   P704P           0 - 268
#>  3 rs570770556 POTEH           0 - 913
#>  4 rs538163832 POTEH           0 - 953
#>  5 rs190224195 POTEH           0 -1399
#>  6 rs2379966   DQ571479        0  4750
#>  7 rs142687051 DQ571479        0  3558
#>  8 rs528403095 DQ571479        0  3309
#>  9 rs555126291 DQ571479        0  2745
#> 10 rs5747567   DQ571479        0 -1778
#> # ... with 1,037 more rows

Visual documentation

valr includes helpful glyphs to illustrate the results of specific operations, similar to those found in the BEDtools documentation. For example, bed_glyph() illustrates the result of intersecting x and y intervals with bed_intersect():

library(valr)

x <- trbl_interval(
  ~chrom, ~start, ~end,
  'chr1', 25,     50,
  'chr1', 100,    125
)

y <- trbl_interval(
  ~chrom, ~start, ~end,
  'chr1', 30,     75
)

bed_glyph(bed_intersect(x, y))

Reproducible reports

valr can be used in RMarkdown documents to generate reproducible work-flows for data processing. Because computations in valr are fast, it can be for exploratory analysis with RMarkdown, and for interactive analysis using shiny.

Remote databases

Remote databases can be accessed with db_ucsc() (to access the UCSC Browser) and db_ensembl() (to access Ensembl databases).

# access the `refGene` tbl on the `hg38` assembly
ucsc <- db_ucsc('hg38')
tbl(ucsc, 'refGene')

API

Function names are similar to their their BEDtools counterparts, with some additions.

Data types

Reading data

Transforming single interval sets

Comparing multiple interval sets

Randomizing intervals

Interval statistics

Utilities