Using lexRankr to find a user’s most representative tweets

Adam Spannbauer


Packages Used


In this document we get tweets from twitter using the twitter API and then analyze the tweets using lexRankr in order to find a user’s most representative tweets. If you don’t care about interacting with the twitter api you can jump to the lexrank analysis.

Get user tweets

Before we can analyze tweets we’ll need some tweets to analyze. We’ll be using Twitter’s API, and you’ll need to set up an account to get all keys needed for the api. The credentials needed for the api are: consumer key, consumer secret, token, and token secret. Below is how to set up your credentials to use the twitter api in this vignette.

# set api tokens/keys/secrets as environment vars
# Sys.setenv(cons_key     = 'my_cons_key')
# Sys.setenv(cons_secret  = 'my_cons_sec')
# Sys.setenv(token        = 'my_token')
# Sys.setenv(token_secret = 'my_token_sec')

#sign oauth
auth <- httr::oauth_app("twitter", key=Sys.getenv("cons_key"), secret=Sys.getenv("cons_secret"))
sig  <- httr::sign_oauth1.0(auth, token=Sys.getenv("token"), token_secret=Sys.getenv("token_secret"))

Now that we have our credentials set up, let’s write a function to get a user’s tweets from the api. Below the function get_timeline_df is defined. The function takes a user’s twitter handle, the number of tweets to get from the api, and the credentials we just set up. The function will return a dataframe with the columns created_at, favorite_count, retweet_count, text. The twitter api limits 200 tweets per get, so we will use a loop until we get the desired number of tweets.

get_timeline_df <- function(user, n_tweets=200, oauth_sig) {
  i <- 0
  n_left <- n_tweets
  timeline_df <- NULL
  #loop until n_tweets are all got
  while (n_left > 0) {
    n_to_get <- min(200, n_left)
    i <- i+1
    #incorporae max id in get_url (so as not to download same 200 tweets repeatedly)
    if (i==1) {
      get_url <- paste0("",
                       user,"&count=", n_to_get)
    } else {
      get_url <- paste0("",
                       user,"&count=",n_to_get,"&max_id=", max_id)
    #GET tweets
    response <- httr::GET(get_url, oauth_sig)
    #extract content and clean up
    response_content <- httr::content(response)
    json_content     <- jsonlite::toJSON(response_content)
    #clean out evil special chars
    json_conv <- iconv(json_content, "UTF-8", "ASCII", sub = "") %>%
      stringr::str_replace_all("\003", "") #special character (^C) not caught by above clean
    timeline_list <- jsonlite::fromJSON(json_conv)
    #extract desired fields
    fields_i_care_about <- c("id", "text", "favorite_count", "retweet_count", "created_at")
    timeline_df <- purrr::map(fields_i_care_about, ~unlist(timeline_list[[.x]])) %>% 
      purrr::set_names(fields_i_care_about) %>% 
      dplyr::as_data_frame() %>% 
      dplyr::bind_rows(timeline_df) %>% 
    #store min id (oldest tweet) to set as max id for next GET
    max_id <- min(purrr::map_dbl(timeline_list$id, 1))
    #update number of tweets left
    n_left <- n_left-n_to_get

We can now use our function to gather a user’s tweets with the additional information of date-time, favorites, retweets. Lets use one of the most famous twitter accounts as of late: [@realDonaldTrump](

tweets_df <- get_timeline_df("realDonaldTrump", 600, sig) %>% 
    mutate(text = str_replace_all(text, "\n", " ")) #clean out newlines for display

tweets_df %>% 
  head(n=3) %>% 
  select(text, created_at) %>% 
text created_at
Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY! Tue Dec 20 20:27:57 +0000 2016
especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states Tue Dec 20 13:09:18 +0000 2016
Bill Clinton stated that I called him after the election. Wrong, he called me (with a very nice congratulations). He “doesn’t know much” … Tue Dec 20 13:03:59 +0000 2016

Lexrank Analysis

We now have a dataframe that contains a column of tweets. This column of tweets will be the subject of the rest of the analysis. With the data in this format, we only need to call the bind_lexrank function to apply the lexrank algorithm to the tweets. The function will add a column of lexrank scores. The higher the lexrank score the more representative the tweet is of the tweets that we downloaded.

note: typically one would parse documents into sentences before applying lexrank (?unnest_sentences); however we will equate tweets to sentences for this analysis

tweets_df %>% 
  bind_lexrank(text, id, level="sentences") %>% 
  arrange(desc(lexrank)) %>% 
  head(n=5) %>% 
  select(text, lexrank) %>% 
  knitr::kable(caption = "Most Representative @realDonaldTrump Tweets")
Most Representative @realDonaldTrump Tweets
text lexrank
Well, the New Year begins. We will, together, MAKE AMERICA GREAT AGAIN! 0.0085258
Happy Thanksgiving to everyone. We will, together, MAKE AMERICA GREAT AGAIN! 0.0060486
Hopefully, all supporters, and those who want to MAKE AMERICA GREAT AGAIN, will go to D.C. on January 20th. It will be a GREAT SHOW! 0.0059713

Repeating tweetRank analysis for other users

With our get_timeline_df function we can easily repeat this analysis for other users. Below we repeat the whole analysis in a single magrittr pipeline.

get_timeline_df("dog_rates", 600, sig) %>% 
  mutate(text = str_replace_all(text, "\n", " ")) %>% 
  bind_lexrank(text, id, level="sentences") %>% 
  arrange(desc(lexrank)) %>% 
  head(n=5) %>% 
  select(text, lexrank) %>% 
  knitr::kable(caption = "Most Representative @dog_rates Tweets")
Most Representative @dog_rates Tweets
text lexrank
@Lin_Manuel good day good dog 0.0167123
Please keep loving 0.0099864
Here we h*ckin go 0.0085708
Last day to get anything from our Valentine’s Collection by Valentine’s Day! Shop: 0.0077583
Even if I tried (which I would never), I’d last like 17 seconds 0.0073899