gutenbergr: Search and download public domain texts from Project Gutenberg

David Robinson

2018-01-26

The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:

Project Gutenberg Metadata

This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

library(gutenbergr)
gutenberg_metadata
## # A tibble: 51,997 x 8
##    gutenberg_id title         author   guten… lang… gutenber… rights has_…
##           <int> <chr>         <chr>     <int> <chr> <chr>     <chr>  <lgl>
##  1            0 <NA>          <NA>         NA en    <NA>      Publi… T    
##  2            1 The Declarat… Jeffers…   1638 en    United S… Publi… T    
##  3            2 "The United … United …      1 en    American… Publi… T    
##  4            3 John F. Kenn… Kennedy…   1666 en    <NA>      Publi… T    
##  5            4 "Lincoln's G… Lincoln…      3 en    US Civil… Publi… T    
##  6            5 The United S… United …      1 en    American… Publi… T    
##  7            6 Give Me Libe… Henry, …      4 en    American… Publi… T    
##  8            7 The Mayflowe… <NA>         NA en    <NA>      Publi… T    
##  9            8 Abraham Linc… Lincoln…      3 en    US Civil… Publi… T    
## 10            9 Abraham Linc… Lincoln…      3 en    US Civil… Publi… T    
## # ... with 51,987 more rows

For example, you could find the Gutenberg ID of Wuthering Heights by doing:

library(dplyr)

gutenberg_metadata %>%
  filter(title == "Wuthering Heights")
## # A tibble: 1 x 8
##   gutenberg_id title             author guten… langu… gutenb… rights has_…
##          <int> <chr>             <chr>   <int> <chr>  <chr>   <chr>  <lgl>
## 1          768 Wuthering Heights Bront…    405 en     Gothic… Publi… T

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering:

gutenberg_works()
## # A tibble: 40,737 x 8
##    gutenberg_id title         author   guten… lang… gutenber… rights has_…
##           <int> <chr>         <chr>     <int> <chr> <chr>     <chr>  <lgl>
##  1            0 <NA>          <NA>         NA en    <NA>      Publi… T    
##  2            1 The Declarat… Jeffers…   1638 en    United S… Publi… T    
##  3            2 "The United … United …      1 en    American… Publi… T    
##  4            3 John F. Kenn… Kennedy…   1666 en    <NA>      Publi… T    
##  5            4 "Lincoln's G… Lincoln…      3 en    US Civil… Publi… T    
##  6            5 The United S… United …      1 en    American… Publi… T    
##  7            6 Give Me Libe… Henry, …      4 en    American… Publi… T    
##  8            7 The Mayflowe… <NA>         NA en    <NA>      Publi… T    
##  9            8 Abraham Linc… Lincoln…      3 en    US Civil… Publi… T    
## 10            9 Abraham Linc… Lincoln…      3 en    US Civil… Publi… T    
## # ... with 40,727 more rows

It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")
## # A tibble: 10 x 8
##    gutenberg_id title           author gutenb… lang… gutenbe… rights has_…
##           <int> <chr>           <chr>    <int> <chr> <chr>    <chr>  <lgl>
##  1          105 Persuasion      Auste…      68 en    <NA>     Publi… T    
##  2          121 Northanger Abb… Auste…      68 en    Gothic … Publi… T    
##  3          141 Mansfield Park  Auste…      68 en    <NA>     Publi… T    
##  4          158 Emma            Auste…      68 en    <NA>     Publi… T    
##  5          161 Sense and Sens… Auste…      68 en    <NA>     Publi… T    
##  6          946 Lady Susan      Auste…      68 en    <NA>     Publi… T    
##  7         1212 Love and Frein… Auste…      68 en    <NA>     Publi… T    
##  8         1342 Pride and Prej… Auste…      68 en    Best Bo… Publi… T    
##  9        31100 "The Complete … Auste…      68 en    <NA>     Publi… T    
## 10        42078 "The Letters o… Auste…      68 en    <NA>     Publi… T
# or with a regular expression

library(stringr)
gutenberg_works(str_detect(author, "Austen"))
## # A tibble: 13 x 8
##    gutenberg_id title          author  gutenb… lang… gutenbe… rights has_…
##           <int> <chr>          <chr>     <int> <chr> <chr>    <chr>  <lgl>
##  1          105 Persuasion     Austen…      68 en    <NA>     Publi… T    
##  2          121 Northanger Ab… Austen…      68 en    Gothic … Publi… T    
##  3          141 Mansfield Park Austen…      68 en    <NA>     Publi… T    
##  4          158 Emma           Austen…      68 en    <NA>     Publi… T    
##  5          161 Sense and Sen… Austen…      68 en    <NA>     Publi… T    
##  6          946 Lady Susan     Austen…      68 en    <NA>     Publi… T    
##  7         1212 Love and Frei… Austen…      68 en    <NA>     Publi… T    
##  8         1342 Pride and Pre… Austen…      68 en    Best Bo… Publi… T    
##  9        17797 Memoir of Jan… Austen…    7603 en    <NA>     Publi… T    
## 10        31100 "The Complete… Austen…      68 en    <NA>     Publi… T    
## 11        33513 The Frightene… Austen…   36446 en    <NA>     Publi… T    
## 12        39897 Discoveries A… Layard…   40288 en    <NA>     Publi… T    
## 13        42078 "The Letters … Austen…      68 en    <NA>     Publi… T

The meta-data currently in the package was last updated on 05 May 2016.

Downloading books by ID

The function gutenberg_download() downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that “Wuthering Heights” has ID 768 (see the URL here), so gutenberg_download(768) downloads this text.

wuthering_heights <- gutenberg_download(768)

wuthering_heights
## # A tibble: 12,085 x 2
##    gutenberg_id text                                                      
##           <int> <chr>                                                     
##  1          768 WUTHERING HEIGHTS                                         
##  2          768 ""                                                        
##  3          768 ""                                                        
##  4          768 CHAPTER I                                                 
##  5          768 ""                                                        
##  6          768 ""                                                        
##  7          768 1801.--I have just returned from a visit to my landlord--…
##  8          768 neighbour that I shall be troubled with.  This is certain…
##  9          768 country!  In all England, I do not believe that I could h…
## 10          768 situation so completely removed from the stir of society.…
## # ... with 12,075 more rows

Notice it is returned as a tbl_df (a type of data frame) including two variables: gutenberg_id (useful if multiple books are returned), and a character vector of the text, one row per line. Notice that the header and footer added by Project Gutenberg (visible here) have been stripped away.

Provide a vector of IDs to download multiple books. For example, to download Jane Eyre (book 1260) along with Wuthering Heights, do:

books <- gutenberg_download(c(768, 1260), meta_fields = "title")

books
## # A tibble: 32,744 x 3
##    gutenberg_id text                                          title       
##           <int> <chr>                                         <chr>       
##  1          768 WUTHERING HEIGHTS                             Wuthering H…
##  2          768 ""                                            Wuthering H…
##  3          768 ""                                            Wuthering H…
##  4          768 CHAPTER I                                     Wuthering H…
##  5          768 ""                                            Wuthering H…
##  6          768 ""                                            Wuthering H…
##  7          768 1801.--I have just returned from a visit to … Wuthering H…
##  8          768 neighbour that I shall be troubled with.  Th… Wuthering H…
##  9          768 country!  In all England, I do not believe t… Wuthering H…
## 10          768 situation so completely removed from the sti… Wuthering H…
## # ... with 32,734 more rows

Notice that the meta_fields argument allows us to add one or more additional fields from the gutenberg_metadata to the downloaded text, such as title or author.

books %>%
  count(title)
## # A tibble: 2 x 2
##   title                           n
##   <chr>                       <int>
## 1 Jane Eyre: An Autobiography 20659
## 2 Wuthering Heights           12085

Other meta-datasets

You may want to select books based on information other than their title or author, such as their genre or topic. gutenberg_subjects contains pairings of works with Library of Congress subjects and topics. “lcc” means Library of Congress Classification, while “lcsh” means Library of Congress subject headings:

gutenberg_subjects
## # A tibble: 140,173 x 3
##    gutenberg_id subject_type subject                                      
##           <int> <chr>        <chr>                                        
##  1            1 lcc          E201                                         
##  2            1 lcsh         United States. Declaration of Independence   
##  3            1 lcsh         United States -- History -- Revolution, 1775…
##  4            1 lcc          JK                                           
##  5            2 lcc          KF                                           
##  6            2 lcsh         Civil rights -- United States -- Sources     
##  7            2 lcsh         United States. Constitution. 1st-10th Amendm…
##  8            2 lcc          JK                                           
##  9            3 lcsh         Presidents -- United States -- Inaugural add…
## 10            3 lcsh         United States -- Foreign relations -- 1961-1…
## # ... with 140,163 more rows

This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The gutenberg_id column can then be used to download these texts or to link with other metadata.

gutenberg_subjects %>%
  filter(subject == "Detective and mystery stories")
## # A tibble: 521 x 3
##    gutenberg_id subject_type subject                      
##           <int> <chr>        <chr>                        
##  1          170 lcsh         Detective and mystery stories
##  2          173 lcsh         Detective and mystery stories
##  3          244 lcsh         Detective and mystery stories
##  4          305 lcsh         Detective and mystery stories
##  5          330 lcsh         Detective and mystery stories
##  6          481 lcsh         Detective and mystery stories
##  7          547 lcsh         Detective and mystery stories
##  8          863 lcsh         Detective and mystery stories
##  9          905 lcsh         Detective and mystery stories
## 10         1155 lcsh         Detective and mystery stories
## # ... with 511 more rows
gutenberg_subjects %>%
  filter(grepl("Holmes, Sherlock", subject))
## # A tibble: 47 x 3
##    gutenberg_id subject_type subject                                      
##           <int> <chr>        <chr>                                        
##  1          108 lcsh         Holmes, Sherlock (Fictitious character) -- F…
##  2          221 lcsh         Holmes, Sherlock (Fictitious character) -- F…
##  3          244 lcsh         Holmes, Sherlock (Fictitious character) -- F…
##  4          834 lcsh         Holmes, Sherlock (Fictitious character) -- F…
##  5         1661 lcsh         Holmes, Sherlock (Fictitious character) -- F…
##  6         2097 lcsh         Holmes, Sherlock (Fictitious character) -- F…
##  7         2343 lcsh         Holmes, Sherlock (Fictitious character) -- F…
##  8         2344 lcsh         Holmes, Sherlock (Fictitious character) -- F…
##  9         2345 lcsh         Holmes, Sherlock (Fictitious character) -- F…
## 10         2346 lcsh         Holmes, Sherlock (Fictitious character) -- F…
## # ... with 37 more rows

gutenberg_authors contains information about each author, such as aliases and birth/death year:

gutenberg_authors
## # A tibble: 16,236 x 7
##    gutenberg_author_id author   alias    birth… deat… wikipe… aliases     
##                  <int> <chr>    <chr>     <int> <int> <chr>   <chr>       
##  1                   1 United … <NA>         NA    NA <NA>    <NA>        
##  2                   3 Lincoln… <NA>       1809  1865 http:/… United Stat…
##  3                   4 Henry, … <NA>       1736  1799 http:/… <NA>        
##  4                   5 Adam, P… <NA>         NA    NA <NA>    <NA>        
##  5                   7 Carroll… Dodgson…   1832  1898 http:/… <NA>        
##  6                   8 United … <NA>         NA    NA <NA>    Agency, Uni…
##  7                   9 Melvill… Melvill…   1819  1891 http:/… <NA>        
##  8                  10 Barrie,… Barrie,…   1860  1937 http:/… <NA>        
##  9                  12 Smith, … Smith, …   1805  1844 http:/… <NA>        
## 10                  14 Madison… United …   1751  1836 http:/… <NA>        
## # ... with 16,226 more rows

Analysis

What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.

library(tidytext)

words <- books %>%
  unnest_tokens(word, text)

words
## # A tibble: 305,532 x 3
##    gutenberg_id title             word     
##           <int> <chr>             <chr>    
##  1          768 Wuthering Heights wuthering
##  2          768 Wuthering Heights heights  
##  3          768 Wuthering Heights chapter  
##  4          768 Wuthering Heights i        
##  5          768 Wuthering Heights 1801     
##  6          768 Wuthering Heights i        
##  7          768 Wuthering Heights have     
##  8          768 Wuthering Heights just     
##  9          768 Wuthering Heights returned 
## 10          768 Wuthering Heights from     
## # ... with 305,522 more rows
word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)

word_counts
## # A tibble: 21,201 x 3
##    title                       word           n
##    <chr>                       <chr>      <int>
##  1 Wuthering Heights           heathcliff   421
##  2 Wuthering Heights           linton       346
##  3 Jane Eyre: An Autobiography jane         342
##  4 Wuthering Heights           catherine    336
##  5 Jane Eyre: An Autobiography rochester    317
##  6 Jane Eyre: An Autobiography sir          315
##  7 Jane Eyre: An Autobiography miss         310
##  8 Jane Eyre: An Autobiography time         244
##  9 Jane Eyre: An Autobiography day          232
## 10 Jane Eyre: An Autobiography looked       221
## # ... with 21,191 more rows

You may also find these resources useful: