tokenizers

An R package that collects functions with a consistent interface to convert natural language text into tokens.

Author: Lincoln Mullen and Dmitriy Selivanov
License: MIT

CRAN_Status_Badge CRAN_Downloads Travis-CI Build Status AppVeyor Build Status Coverage Status

Installation

You can install this package from CRAN:

install.packages("tokenizers")

To get the development version from GitHub, use devtools.

# install.packages("devtools")
devtools::install_github("ropensci/tokenizers")

Examples

The tokenizers in this package have a consistent interface. They all take either a character vector of any length, or a list where each element is a character vector of length one. The idea is that each element comprises a text. Then each function returns a list with the same length as the input vector, where each element in the list contains the tokens generated by the function. If the input character vector or list is named, then the names are preserved, so that the names can serve as identifiers.

library(magrittr)
library(tokenizers)

james <- paste0(
  "The question thus becomes a verbal one\n",
  "again; and our knowledge of all these early stages of thought and feeling\n",
  "is in any case so conjectural and imperfect that farther discussion would\n",
  "not be worth while.\n",
  "\n",
  "Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n",
  "for us _the feelings, acts, and experiences of individual men in their\n",
  "solitude, so far as they apprehend themselves to stand in relation to\n",
  "whatever they may consider the divine_. Since the relation may be either\n",
  "moral, physical, or ritual, it is evident that out of religion in the\n",
  "sense in which we take it, theologies, philosophies, and ecclesiastical\n",
  "organizations may secondarily grow.\n"
)

tokenize_characters(james)[[1]] %>% head(50)
#>  [1] "t" "h" "e" "q" "u" "e" "s" "t" "i" "o" "n" "t" "h" "u" "s" "b" "e"
#> [18] "c" "o" "m" "e" "s" "a" "v" "e" "r" "b" "a" "l" "o" "n" "e" "a" "g"
#> [35] "a" "i" "n" "a" "n" "d" "o" "u" "r" "k" "n" "o" "w" "l" "e" "d"
tokenize_character_shingles(james)[[1]] %>% head(20)
#>  [1] "the" "heq" "equ" "que" "ues" "est" "sti" "tio" "ion" "ont" "nth"
#> [12] "thu" "hus" "usb" "sbe" "bec" "eco" "com" "ome" "mes"
tokenize_words(james)[[1]] %>% head(10)
#>  [1] "the"      "question" "thus"     "becomes"  "a"        "verbal"  
#>  [7] "one"      "again"    "and"      "our"
tokenize_word_stems(james)[[1]] %>% head(10)
#>  [1] "the"      "question" "thus"     "becom"    "a"        "verbal"  
#>  [7] "one"      "again"    "and"      "our"
tokenize_sentences(james) 
#> [[1]]
#> [1] "The question thus becomes a verbal one again; and our knowledge of all these early stages of thought and feeling is in any case so conjectural and imperfect that farther discussion would not be worth while."                                               
#> [2] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean for us _the feelings, acts, and experiences of individual men in their solitude, so far as they apprehend themselves to stand in relation to whatever they may consider the divine_."
#> [3] "Since the relation may be either moral, physical, or ritual, it is evident that out of religion in the sense in which we take it, theologies, philosophies, and ecclesiastical organizations may secondarily grow."
tokenize_paragraphs(james)
#> [[1]]
#> [1] "The question thus becomes a verbal one again; and our knowledge of all these early stages of thought and feeling is in any case so conjectural and imperfect that farther discussion would not be worth while."                                                                                                                                                                                                                                                                   
#> [2] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean for us _the feelings, acts, and experiences of individual men in their solitude, so far as they apprehend themselves to stand in relation to whatever they may consider the divine_. Since the relation may be either moral, physical, or ritual, it is evident that out of religion in the sense in which we take it, theologies, philosophies, and ecclesiastical organizations may secondarily grow. "
tokenize_ngrams(james, n = 5, n_min = 2)[[1]] %>% head(10)
#>  [1] "the question"                   "the question thus"             
#>  [3] "the question thus becomes"      "the question thus becomes a"   
#>  [5] "question thus"                  "question thus becomes"         
#>  [7] "question thus becomes a"        "question thus becomes a verbal"
#>  [9] "thus becomes"                   "thus becomes a"
tokenize_skip_ngrams(james, n = 5, k = 2)[[1]] %>% head(10)
#>  [1] "the becomes one our all"          "question a again knowledge these"
#>  [3] "thus verbal and of early"         "becomes one our all stages"      
#>  [5] "a again knowledge these of"       "verbal and of early thought"     
#>  [7] "one our all stages and"           "again knowledge these of feeling"
#>  [9] "and of early thought is"          "our all stages and in"
tokenize_lines(james)[[1]] %>% head(5)
#> [1] "The question thus becomes a verbal one"                                   
#> [2] "again; and our knowledge of all these early stages of thought and feeling"
#> [3] "is in any case so conjectural and imperfect that farther discussion would"
#> [4] "not be worth while."                                                      
#> [5] "Religion, therefore, as I now ask you arbitrarily to take it, shall mean"
tokenize_regex(james, pattern = "[,.;]")[[1]] %>% head(5)
#> [1] "The question thus becomes a verbal one\nagain"                                                                                                                     
#> [2] " and our knowledge of all these early stages of thought and feeling\nis in any case so conjectural and imperfect that farther discussion would\nnot be worth while"
#> [3] "\n\nReligion"                                                                                                                                                      
#> [4] " therefore"                                                                                                                                                        
#> [5] " as I now ask you arbitrarily to take it"

Contributing

Contributions to the package are more than welcome. One way that you can help is by using this package in your R package for natural language processing. If you want to contribute a tokenization function to this package, it should follow the same conventions as the rest of the functions whenever it makes sense to do so.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.


rOpenSCi logo