ANLP Package

Achal Shah

2016-07-10

ANLP is a package which provides all the functionalities to build text prediction model.

Functions

Functionalities supported by ANLP package:

readTextFile

This function reads text data from file in specificied encoding.

library(ANLP)
print(length(twitter.data))
## [1] 109091

There are more than 100k tweets in the dataset. Initially we will sample 10k tweets to build our model.

sampleTextData

We need to sample 10% of the data. So, we will use SampleTextData function following way:

train.data <- sampleTextData(twitter.data,0.1)
print(length(train.data))
## [1] 10839
head(train.data)
## [1] "Desk put together, room all set up. Oh boy, oh boy"                                                                      
## [2] "ya ik and i never asked him to follow me i only mentioned him once in one of my tweets- i didnt do anything else"        
## [3] "Small market baseball. You, know...for the 99%."                                                                         
## [4] "nice I watched the whole series, LOVED Julia and her mom Erica was such a badass"                                        
## [5] "I know, I know. Then you kick yourself when the fight goes lopsided. But if the upset DOES happen, wow. Nothing like it."
## [6] "love chris brown"

Now, we have 10k tweets but we can see that data is very impure. There are many punctuations, abbreviations, contractions.

cleanTextData

train.data.cleaned <- cleanTextData(train.data)
train.data.cleaned[[1]]$content[1:5]
## [1] "desk put together room all set up oh boy oh boy"                                                                   
## [2] "ya ik and i never asked him to follow me i only mentioned him once in one of my tweets i didnt do anything else"   
## [3] "small market baseball you knowfor the "                                                                            
## [4] "nice i watched the whole series loved julia and her mom erica was such a badass"                                   
## [5] "i know i know then you kick yourself when the fight goes lopsided but if the upset does happen wow nothing like it"

As we can see, all the texts are now cleaned and looks good :)

Now, next step is to build N-gram models by using our cleaned data corpus.

generateTDM

We will build 1,2,3 gram models and generate term frequency matrix for all the data.

unigramModel <- generateTDM(train.data.cleaned,1)
head(unigramModel)
##       word freq
## 13722  the 4256
## 15553  you 2761
## 497    and 1853
## 5129   for 1736
## 9345   not 1452
## 13712 that 1212
bigramModel <- generateTDM(train.data.cleaned,2)
head(bigramModel)
##          word freq
## 28339    i am  702
## 31422   it is  470
## 29809  in the  376
## 16407  do not  372
## 21435 for the  332
## 42993  of the  273
trigramModel <- generateTDM(train.data.cleaned,3)
head(trigramModel)
##                     word freq
## 37375           i do not  131
## 77823     thanks for the  110
## 15410       can not wait   68
## 37222          i can not   65
## 37008           i am not   56
## 49144 looking forward to   53

Good work :) Now we have all 3 models so lets predict.

predict_Backoff

This function accepts list of all the N-gram models. So, lets merge all the N-gram models in single list.
Note: Remember to merge N-gram models in decending order. (3,2,1 Ngram models)

nGramModelsList <- list(trigramModel,bigramModel,unigramModel)

Lets predict some strings:

testString <- "I am the one who"
predict_Backoff(testString,nGramModelsList)
## [1] "blew"
testString <- "what is my"
predict_Backoff(testString,nGramModelsList)
## [1] "favorite"
testString <- "the best movie"
predict_Backoff(testString,nGramModelsList)
## [1] "about"

Enjoy and free feel to give feedbacks on achalshah20@gmail.com