The epitweetr
package allows you to automatically monitor trends of tweets by time, place and topic. This automated monitoring aims at early detecting public health threats through the detection of signals (e.g. an unusual increase in the number of tweets for a specific time, place and topic). The epitweetr
package was designed to focus on infectious diseases, and it can be extended to all hazards or other fields of study by modifying the topics and keywords.
The general principle behind epitweetr
is that it collects tweets and related metadata from the Twitter Standard API versions 1.1 (https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/overview) and 2.0 (https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent) according to specified topics and stores these tweets on your computer on a database that can operate to calculate statistics or as a search engine. epitweetr
geolocalises the tweets and collects information on key words, URLs, hashtags within a tweet but also entities and context detected by the Twitter API 2.0. Tweets are aggregated according to topic and geographical location. Next, a signal detection algorithm identifies the number of tweets (by topic and geographical location) that exceeds what is expected for a given day. If a number of tweets exceeds what is expected, epitweetr
sends out email alerts to notify those who need to further investigate these signals following the epidemic intelligence processes (filtering, validation, analysis and preliminary assessment).
The package includes an interactive web application (Shiny app) with six pages: the dashboard, where a user can visualise and explore tweets (Fig 1), the alerts page, where you can view the current alerts and train machine learning models for alert classification on user defined categories (Fig 2), the geotag page, where you can evaluate the geolocation algorithm and provide annotations for improving its performance (Fig 3), the data protection page, where the user can search, anonymise and delete tweets from the epitweetr database to support data deletion requests (Fig 4), the configuration page, where you can change settings and check the status of the underlying processes (Fig 5), and the troubleshoot page, with automatic checks and hints for using epitweetr
with all its functionalities (Fig 6).
On the dashboard, users can view the aggregated number of tweets over time, the location of these tweets on a map and different most frequent elements found in or extracted from these tweets (words, hashtags, URLs, contexts and entities). These visualisations can be filtered by the topic, location and time period you are interested in. Other filters are available and include the possibility to adjust the time unit of the timeline, whether retweets/quotes should be included, what kind of geolocation types you are interested in, the sensitivity of the prediction interval for the signal detection, and the number of days used to calculate the threshold for signals. This information is also downloadable directly from this interface in the form of data, pictures, and/or reports.
More information on the methodology used is available here
Shiny app dashboard:
Shiny app alerts page:
Shiny app geotag evaluation page:
Shiny app data protection page:
Shiny app configuration page:
Shiny app troubleshoot page:
Article 3 of the European Centre for Disease Prevention and Control (ECDC) funding regulation and the Decision No 1082/2013/EU on serious cross-border threats to health have established the detection of public health threats as a core activity of ECDC.
ECDC performs Epidemic Intelligence (El) activities aiming at rapidly detecting and assessing public health threats, focusing on infectious diseases, to ensure EU’s health security. ECDC uses social media as part of its sources to early detect signals of public health threats. Until 2020, the monitoring of social media was mainly performed through the screening and analysis of posts from pre-selected experts or organisations, mainly in Twitter and Facebook.
More information and an online tutorial are available:
The primary objective of epitweetr
is to use the Twitter Standard Search API version 1.1 and Twitter Recent Search API version 2 in order to detect early signals of potential threats by topic and by geographical unit.
Its secondary objective is to enable the user through an interactive web interface to explore the trend of tweets by time, geographical location and topic, including information on top words and numbers of tweets from trusted users, using charts and tables.
The minimum and suggested hardware requirements for the computer are in the table below:
Hardware requirements | Minimum | Suggested |
---|---|---|
RAM Needed | 8GB | 16GB recommended |
CPU Needed | 4 cores | 12 cores |
Space needed for 3 years of storage | 3TB | 5TB |
The CPU and RAM usage can be configured on the Shiny app configuration page (see section The interactive user application (Shiny app)>The configuration page). The RAM, CPU and space needed may depend on the amount and size of the topics you request in the collection process.
epitweetr
is conceived to be platform independent, working on Windows, Linux and Mac. We recommend that you use epitweetr
on a computer that can be run continuously. You can switch the computer off, but you may miss some tweets if the downtime is large enough, which will have implications for the alert detection.
If you need to reinstall epitweetr after activating its tasks, you must restart the machine running epitweetr first.
Before using epitweetr
, the following items need to be installed:
R version 3.6.3 or higher
Java 1.8 eg. openjdk version “1.8” https://www.java.com/download/. The 64-bit rather than the 32-bit version is preferred, due to memory limitations. In Mac, also the Java Development Kit https://docs.oracle.com/javase/9/install/installation-jdk-and-jre-macos.htm]
If you are running it in Windows, you will also need Microsoft Visual C++, however in most cases it is likely to be pre-installed:
Pandoc, for exporting PDFs and Markdown
Tex installation (TinyTeX or MiKTeX) (or other TeX installation) for exporting PDFs
Easiest: https://yihui.org/tinytex/ install from R, logoff/logon required after installation
https://miktex.org/download full installation required, logoff/logon required after installation
Machine learning optimisation (only for advanced users)
Open Blas (BLAS optimizer), which will speed up some of the geolocation processes: https://www.openblas.net/ Installation instructions: https://github.com/fommil/netlib-Java
or Intel MKL (https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html)
A scheduler
If using Windows, you need to install the R package: taskscheduleR
If using Linux, you need to plan the tasks manually
If using a Mac, you need to plan the tasls manually
If you would like to develop epitweetr
further, then the following development tools are needed:
Git (source code control) https://git-scm.com/downloads
Sbt (compiling scala code) https://www.scala-sbt.org/download.html
If you are using Windows, then you will additionally need Rtools: https://cran.r-project.org/bin/windows/Rtools/
epitweetr
will need to download some dependencies in order to work. The tool will do this automatically the first time the alert detection process is launched. The Shiny app configuration page will allow you to change the target URLs of these dependencies, which are the following:
CRAN JARs: Transitive dependencies for running Spark, Lucene and embedded scala code. [https://repo1.maven.org/maven2]
Winutils.exe (Windows only) This is a Hadoop binary necessary for running SPARK locally on Windows [https://github.com/steveloughran/winutils/raw/master/hadoop-3.0.0/bin/winutils.exe].
Please note that during the dependency download you will be prompted, first to stop the embedded database and then enable it again. If you are on Windows and you have activated the tasks using the ‘activate’ buttons on the configuration page you can performs this tasks by disabling and enabling the tasks on the ‘Windows Task Scheduler’. For more information see the section ‘Setting up tweet collection and the alert detection loop’
After installing all required dependencies listed in the section “Prerequisites for running epitweetr”, you can install epitweetr
:
Additionally, the R environment needs to know where the Java installation home is. To check this, type in the R console:
If the command returns null or empty, then you will need to set the Java Home environment variable, for your operating system (OS), please see your specific OS instructions. In some cases, epitweetr
can work without setting the Java Home environment variable.
The first time you run the application, if the tool cannot identify a secure password store provided by the operating system, you will see a pop-up window requesting a keyring password (Linux and Mac). This is a password necessary for storing encrypted Twitter credentials. Please choose a strong password and remember it. You will be asked for this password each time you run the tool. You can avoid this by setting a system environment variable named ecdc_twitter_tool_kr_password containing the chosen password.
You can launch the epitweetr
Shiny app from the R session by typing in the R console. Replace “data_dir” with the designated data directory which is a local folder you choose to store tweets, time series and configuration files in:
Please note that the data directory entered in R should have ‘/’ instead of ‘\’ (an example of a correct path would be ‘C:/user/name/Documents’). This applies especially in Windows if you copy the path from the File Explorer.
Alternatively, you can use a launcher: In an executable .bat or shell file type the following, (replacing “data_dir” with the designated data directory)
R –vanilla -e epitweetr::epitweetr_app(“data_dir”)
You can check that all requirements are properly installed in the troubleshoot page. More information is available in section The interactive user application (Shiny app)>Dashboard:The interactive user interface for visualisation>The troubleshoot page
Migrating epitweetr from previous versions to version 2.0 or higher is possible without any data loss. On this section we will describe the necessary steps to perform the migration.
This migration is not necessary if you are installing epitweetr for the first time.
In epitweetr v2, we have redesigned the way how tweets and series are stored. On previous versions, tweets were saved as compressed JSON files and series as RDS data frames on ‘tweets’ and ‘series’ folder, respectively. In addition, we have moved to a different storage system allowing epitweetr to work as a search engine and allowing efficient updates, deletions and faster aggregation. For doing so, data is stored using Apache Lucene indexes in the ‘fs’ folder. Note that during migration, Twitter data are moved to the ‘fs’ folder and series are left as it is. Epitweetr reports will combine data from older and new storage system.
If you have an existing installation that contains data in the previous format, you have to migrate it following the steps detailed in this section. This applies to any epitweetr version before v2.0.0. You can also check this by looking in ‘tweets/geo’ or ‘tweets/search’ folders. If there is a json.gz file, migration is needed.
The migration steps are the following:
epitweetr::detect_loop("data\_dir")
In order to use epitweetr
, you will need to collect and process tweets, run the epitweetr database and run the requirements and alerts pipeline. Further details are also available in subsequent sections of the user documentation. A summary of the steps needed is as follows:
Set up the Twitter authentication using a Twitter account or a Twitter developer app, see section Collection of tweets>Twitter authentication for more details
If you want to use the Twitter API V2 you have to request access from the Twitter developer portal. More information on [https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api]
Activate the embedded database
Windows: Click on the “Epitweetr database” activate button
Other platforms: In a new R session run the following command
You can confirm that the embeded database is running if the “Embeded database” status is “Running” in the Shiny app configuration page and “true” in the Shiny app troubleshoot page.
Activate the tweet collection and data processing
Windows: Click on the “Data collection & processing” activate button
Other platforms: In a new R session run the following command
You can confirm that the tweet collection is running if the “Data collection & processing” status is “Running” on the Shiny app configuration page (green text in screenshot above) and “true” in the Shiny app troubleshoot page.
Activate the epitweetr database
Windows: Click on the “epitweetr database” activate button
Other platforms: In a new R session run the following command
You can confirm that the epitweetr database is active if the “epitweetr database” status is “Running” on the Shiny app configuration page (green text in screenshot above) and “true” in the Shiny app troubleshoot page.
Activate the Requirements & alerts pipeline:
You can confirm that the requirements & alerts pipeline is running if the “Requirements & alerts pipeline” status is “Running” in the Shiny app configuration page and “true” in the Shiny app troubleshoot page.
You will be able to visualise tweets after “Data collection & processing” and “epitweetr database” are activated and the languages task has finished successfully.
You can start working with the generated signals. Happy signal detection!
For more details you can go through the section How does it work? General architecture behind epitweetr
, which describes the underlying processes behind the tweet collection and the signal detection. Also, the section “The interactive Shiny application (Shiny app)>The configuration page” describes the different settings on the configuration page.
The following sections describe in detail the above general principles. The settings of many of these elements can be configured in the Shiny app configuration page, which is explained in the section The interactive Shiny application (Shiny app)>The configuration page.
epitweetr
uses the Twitter Standard Search API version 1.1 and/or Twitter Recent Search API version 2.0. The advantage of these APIs is that these are a free service provided by Twitter enabling users of epitweetr
to access tweets free of charge. The search API is not meant to be an exhaustive source of tweets. It searches against a sample of recent tweets published in the past 7 days and it focuses on relevance and not completeness. This means that some tweets and users may be missing from search results.
While this may be a limitation in other fields of public health or research, the epitweetr
development team believe that for the objective of signal detection a sample of tweets is sufficient to detect potential threats of importance in combination with other type of sources.
Other attributes of the Twitter Standard Search API version 1.1 include:
Only tweets from the last 5–8 days are indexed by Twitter
A maximum of 180 requests every 15 minutes are supported by the Twitter Standard Search API (450 requests every 15 minutes if you are using the Twitter developer app credentials; see next section)
Each request returns a maximum of 100 tweets and/or retweets
Other attributes of the Twitter Recent Search API version 2.0 include:
Only tweets from the last week days are indexed by Twitter
A maximum of 300 requests every 15 minutes are supported
Each request returns a maximum of 100 tweets and/or retweets
500.000 tweets per month in the essential access level
If you are using both endpoints epitweetr
will alternate between them when the limits are hit.
You can authenticate the collection of tweets by using a Twitter account (this approach utilises the rtweet package app) or by using a Twitter application. For the latter, you will need a Twitter developer account, which can take some time to obtain, due to verification procedures. We recommend using a Twitter account via the rtweet package for testing purposes and short-term use, and the Twitter developer application for long-term use.
Using a Twitter account: delegated via rtweet (user authentication)
You will need a Twitter account (username and password)
The rtweet package will send a request to Twitter, so it can access your Twitter account on your behalf
A pop-up window will appear where you can enter your Twitter user name and password to confirm that the application can access Twitter on your behalf. You will send this token each time you access tweets. If you are already logged in Twitter, this pop-up window may not appear and automatically take the credentials of the ‘active’ Twitter account in the machine
You can only use Twitter API version 1.1
Using a Twitter developer app: via epitweetr
(app authentication)
If you have not done so already, you will need to create a Twitter developer account: [https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api]
Follow the instuctions, answer the questions to activate the Twitter API v2 using Essential access.
Next, you will create a Project and an associated developer App during the onboarding process, which will provide you a set of credentials that you will use to authenticate all requests to the API.
Make a note of your OAuth settings
Add them to the configuration page in the Shiny app (see image below)
With this information epitweetr
can request a token at any time directly to Twitter. The advantage of this method is that the token is not connected to any user information and tweets are returned independently of any user context.
With this app, you can perform 450 requests every 15 minutes instead of the 180 requests every 15 minutes that a Twitter account allows.
You can activate Twitter API version 2.0 in the config page
After the Twitter authentication, you need to specify a list of topics in epitweetr
to indicate which tweets to collect. For each topic, you have one or more queries that epitweetr
uses to collect the relevant tweets (e.g. several queries for a topic using different terminology and/or languages).
A query consists of keywords and operators that are used to match tweet attributes. Keywords separated by a space indicate an AND clause. You can also use an OR operator. A minus sign before the keyword (with no space between the sign and the keyword) indicates the keyword should not be in the tweet attributes. While queries can be up to 512 characters long, best practice is to limit your query to 10 keywords and operators and limit complexity of the query, meaning that sometimes you need more than one query per topic. If a query surpasses this limit, it is recommended to split the topic in several queries.
epitweetr
comes with a default list of topics as used by the ECDC Epidemic Intelligence team at the date of package generation (15th of December, 2021). You can view details of the list of topics in the Shiny app configuration page (see screenshot below). In addition, the colour coding in the downloadable file allows users to see if the query for a topic is too long (red colour) and the topic should be split in several queries.