Twitter Sentiment Analysis in R

In this post I use R to perform sentiment analysis of Twitter data. Sentiment analysis is part of a broader set of tools available in the realm of NLP (natural language processing). For a more comprehensive overview of this area, this course is very helpful. Here I'll use the twitteR package (interfaces with Twitter APIs) to obtain thousands of tweets about a few different presidential candidates. I'll then explore the content of these tweets in a few different ways. Sentiment analysis is often used to quantify users or consumer sentiment about online products or services. The input could be user reviews on Amazon or Google, or tweets about particular products or companies. Sentiment analysis can also be used to measure and even predict public opinion about political candidates, markets, or brands. The input to these analyses could be tweets or web search terms, for example. The output of sentiment analysis ranges from a simple binary measure (positive vs. negative sentiment), to more complex multidimensional measures of affect and attitude. Here I'll solely focus on a continuous measure of valence, or how words make people feel, which is available from a large database of human ratings.

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, collapse=TRUE, error=TRUE)

Interface with Twitter API

The R package twitteR provides an interface to the Twitter API, which requires authentication via OAuth (open standard for authorization). The R package ROAuth enables such authentification. To authenticate, I'll need my api_key, api_secret, from my app settings on my Twitter Application Management page. I have a local file 'creds' that contains my Consumer Key (API Key) and Consumer Secret (API Secret) that I got from my Twitter Application Management page.

require(twitteR)
require(ROAuth)

consumerKey <- creds$C_key
consumerSecret <- creds$C_secret
accessToken <- creds$A_token
accessSecret <- creds$A_secret

# This function wraps the OAuth authentication handshake functions
# from the httr package for a twitteR session
setup_twitter_oauth(
  consumer_key = consumerKey,
  consumer_secret = consumerSecret, 
  access_token = accessToken, 
  access_secret = accessSecret)
## [1] "Using direct authentication"

This function wraps the searchTwitter function to collect several tweets about a 'topic', from a 2D spherical area roughly encompassing the USA (and then some). The function removes retweets, and returns a data frame where the first column is the name of the topic–we need this index when we supply multiple topics. The 'since' and 'until' parameters determine the time window within which tweets will be pulled.

library(dplyr)

tweet2df = function(topic, N, since=NULL, until=NULL) {
  t <- searchTwitter(topic, n=N, since=since, until=until, 
                     geocode='39.8,-95.6,2400km',
                     retryOnRateLimit=10, lang="en")
  # retain 'original' tweets only
  t <- strip_retweets(t, strip_manual = TRUE, strip_mt = TRUE)
  # convert to data frame
  df <- twListToDF(t)
  df$longitude <- as.numeric(df$longitude)
  df$latitude <- as.numeric(df$latitude)
  return(cbind(topic, df))
}

The lapply call cycles through a few topics (I chose presidential candidates) and time windows, returning a data frame.

# search parameters 
days_ago <- 9:1
N <- 2500
topic <- c('Donald Trump', 'Ted Cruz', 'Hillary Clinton', 'Bernie Sanders')

# create data frame containing all search parameters
since <- format(Sys.Date()-days_ago)
parms <- data.frame(expand.grid(since=since, N=N, topic=topic))
parms$until <- format(Sys.Date()-(days_ago-1))
i <- sapply(parms, is.factor)
parms[i] <- lapply(parms[i], as.character)

# Call custom function with all combinations of search parms
# use dplyr::bind_rows instead of do.call(rbind(...))
timedTweets <- bind_rows(lapply(1:nrow(parms), 
  function(x) tweet2df(parms$topic[x], N=N, 
    since=parms$since[x], 
    until=parms$until[x]))) %>% 
  as.data.frame()

# Call function again, not restricting the time range
untimedTweets <- bind_rows(lapply(1:length(topic), 
  function(x) tweet2df(topic[x], N=N))) %>% 
  as.data.frame()

# Combine into single data frame 
allTweets <- rbind(timedTweets, untimedTweets)

Let's add a factor corresponding to the date of each tweet, extracted from the 'created' field.

library(stringr)
allTweets$day <- str_split_fixed(allTweets$created," ",2)[,1]

Text processing

The tweets in the 'text' column need to be tokenized. One way to go uses the tm package, a comprehensive text mining framework for R. First we'll create a corpus from the tweets.

library(tm)

# create a vector source, which interprets each element of its argument as a document
v <- VectorSource(allTweets$text)

# create an object of class 'corpus'; a collection of documents containing NL text
docs <- Corpus(v)

# convert corpus to plain text
docs <- tm_map(docs, PlainTextDocument)

The goal now is to tokenize the raw text into units of some kind. Although there are ways to assess the sentiment of phrases and sentences, here the tokens will be individual words.

Using R base functions, first convert text into the ASCII scheme and then change all text to lower case.

docs <- tm_map(docs, content_transformer(function(x) iconv(x, to='ASCII', sub='byte')))
docs <- tm_map(docs, content_transformer(function(x) tolower(x)))

Then remove some of the most common words in English (e.g. you, me, my, the), that won't add anything to a sentiment analysis. These are called stop words in the NLP literature.

docs <- tm_map(docs, removeWords, stopwords('en'))

Next, let's remove strings that start with URL indicators.

# remove URLs
stripURL = function(x) {
  gsub("www[^[:space:]]+|htt[^[:space:]]+", "", x)
}
docs <- tm_map(docs, content_transformer(stripURL))

We also don't need any numbers or punctuation marks, and extra whitespace can go. I won't deal with Emoticons in this post, but one generally should. I could continue to use base R functions to clean up the text (like above), but the tm package has ready made functions for the common tasks.

docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

It's often a good idea to stem the words in the corpus–to remove affix morphemes from stem morphemes (e.g., running becomes run). The affective word rating dataset I'll use here contains “love”, for example, but not “loved” or “loving”. We could use the relatively simple but very common Porter's algorithm, included as the default in the tm package. I'll show the code here but I won't use it because I've found that the algorithm changes too many words incorrectly (e.g. 'immigrants' gets turned into 'immigr').

# docs <- tm_map(docs, stemDocument)

Convert the cleaned up text back to a character vector and attach to original data frame, then remove the original 'text' column. Also remove duplicate tweets within each topic. Now we're ready to go.

allTweets$tweet <- as.character(unlist(sapply(docs, `[`, "content")))
allTweets$text <- NULL

# remove rows with duplicate tweets
allTweets <- allTweets %>% 
  group_by(topic) %>% 
  distinct(., tweet)

Word clouds

First I'll make some word clouds, which is a good way to see the most frequent words in a corpus. Since the size of the words will reflect their frequency, I'll remove the topic (twitter search term) from each tweet first; otherwise these would dominate the word clouds.

# function to remove topic from all tweets about that topic
removeTopic = function(topic, tweets) {
  words <- unlist(str_split(tolower(topic), boundary("word")))
  pattern <- paste(words,sep="",collapse = "|")
  out <- gsub(pattern, '', tweets)
  return(out)
}

# call function rowwise
allTweets <- allTweets %>% 
  rowwise() %>% 
  mutate(tweet = removeTopic(topic, tweet)) %>%
  as.data.frame()

library(wordcloud)
col=brewer.pal(8, 'Set1')

topics <- unique(allTweets$topic)

lapply(1:length(topics), function(x) {
  print(topics[x])
  dat2cloud <- subset(allTweets, topic==topics[x])
  text2cloud <- dat2cloud$tweet
  corp <- Corpus(VectorSource(text2cloud))
  print(wordcloud(corp, max.words=75, random.color=F, 
        random.order=F, colors=col))
  }
)
## [1] "Donald Trump"

plot of chunk unnamed-chunk-14

## NULL
## [1] "Ted Cruz"

plot of chunk unnamed-chunk-14

## NULL
## [1] "Hillary Clinton"

plot of chunk unnamed-chunk-14

## NULL
## [1] "Bernie Sanders"

plot of chunk unnamed-chunk-14

## NULL
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL

Sentiment analysis

There are several large lexicons containing valenced (good/bad) word measures. I downloaded the affective ratings, concreteness ratings, subtitle word frequency ratings, and POS tags available from Marc Brysbaert's research group at Ghent University. I merged across these datasets creating a single dataset, which contains over 13,000 English words.

lex <- read.csv('Brysbaert_ratings.csv', stringsAsFactors = F)
head(lex)
##          Word V.Mean.Sum V.SD.Sum A.Mean.Sum A.SD.Sum D.Mean.Sum D.SD.Sum
## 1    aardvark       6.26     2.21       2.41     1.40       4.27     1.75
## 2     abandon       2.84     1.54       3.73     2.43       3.32     2.50
## 3 abandonment       2.63     1.74       4.95     2.64       2.64     1.81
## 4     abdomen       5.43     1.75       3.68     2.23       5.15     1.94
## 5   abdominal       4.48     1.59       3.50     1.82       5.32     2.11
## 6      abduct       2.42     1.61       5.90     2.57       2.75     2.13
##   Bigram Conc.M Conc.SD SUBTLEX   Dom_Pos
## 1      0   4.68    0.86      21      Noun
## 2      0   2.54    1.45     413      Verb
## 3      0   2.54    1.29      49      Noun
## 4      0   4.70    0.67     171      Noun
## 5      0   4.28    1.16     174 Adjective
## 6      0   2.88    1.51      36      Verb

First I'll retain valenced words with ratings towards either pole of the 9-point valence rating scale. That is, words that people rated as making them feel at least somewhat happy/unhappy.

valence <- dplyr::filter(lex, V.Mean.Sum <= 4 | V.Mean.Sum >= 6)

I define a function that computes the mean and standard deviation of valence for all tweets about a given topic, on a given day.

# remove 'trump' from all tweets because it's also a verb and noun
allTweets <- allTweets %>% 
  rowwise() %>% 
  mutate(tweet = gsub('trump','',tweet)) %>%
  as.data.frame()

# by-tweet averages: for each row of the original df, take the mean of each numeric measure
# across all words in that tweet that appear in the valence lexicon
measures <- allTweets %>% 
  rowwise() %>% 
  do({
    tweets <- unlist(str_split(.$tweet, boundary("word")))
    dplyr::filter(valence, Word %in% tweets) %>%
    summarise_each(funs(mean), which(sapply(., is.numeric))) %>%
    as.data.frame()
  })
codedTweets <- bind_cols(allTweets, measures)

Let's look at the distribution of sentiment for each topic.

library(ggplot2)
codedTweets$topic <- as.factor(codedTweets$topic)
means <- codedTweets %>% group_by(topic) %>%
  summarise(mean = mean(V.Mean.Sum, na.rm = T))
print(means)
## Source: local data frame [4 x 2]
## 
##             topic     mean
##            (fctr)    (dbl)
## 1  Bernie Sanders 6.100546
## 2    Donald Trump 5.727789
## 3 Hillary Clinton 5.416180
## 4        Ted Cruz 5.763873

ggplot(codedTweets, aes(x=V.Mean.Sum)) +
  geom_histogram(bins=8) +
  geom_vline(data=means, aes(xintercept=mean), col=2) +
  facet_grid(topic ~ .)

plot of chunk unnamed-chunk-18

Are these sentiments changing day by day? Let's visualize it.

byDay <- codedTweets %>% 
  group_by(topic, day) %>% 
  summarise_each(funs(mean(., na.rm = TRUE)), V.Mean.Sum, V.SD.Sum)

ggplot(byDay, aes(x=as.Date(day), y=V.Mean.Sum, color=topic)) +
  geom_point() + geom_line()

plot of chunk unnamed-chunk-19

It looks rather noisy. We'd need more data and probably more time to make any reliable inferences or construct a reliable time series.

Despite the relatively small size of these datasets, let's see how sentiment varies geographically. We'll need to bin the continuous sentiment variable first.

library(ggmap)

# retain rows of data frame with geo data and valence 
geoTweets <- dplyr::filter(codedTweets, !is.na(longitude), 
  !is.na(latitude), !is.na(V.Mean.Sum))

# For each topic, split the mean by-state tweet valence into 3 negative and 3 positive bins for plotting.
geoTweets <- geoTweets %>% 
  group_by(topic) %>% 
  mutate(colorBins = cut(V.Mean.Sum, breaks=c(1,3,4,5,6,7,9), labels=F, include.lowest=T))
plot(geoTweets$V.Mean.Sum, geoTweets$colorBins)

plot of chunk unnamed-chunk-20

These maps via ggmap show the location of every tweet about each topic, where the color of the point varies from more negative sentiment (black) to more positive sentiment (red).

usa <- as.numeric(geocode("United States"))
topics <- unique(allTweets$topic)

lapply(1:4, function(x) {
  Map <- qmap("usa", zoom=4, color="bw", legend="bottomright")
  Map + geom_point(aes(x = longitude, y = latitude, color=colorBins), 
    size=4, data=dplyr::filter(geoTweets, topic==topics[x])) +
    scale_colour_gradient(low = "black", high = "red") +
    labs(title = topics[x])
})
## [[1]]

plot of chunk unnamed-chunk-21

## 
## [[2]]

plot of chunk unnamed-chunk-21

## 
## [[3]]

plot of chunk unnamed-chunk-21

## 
## [[4]]

plot of chunk unnamed-chunk-21

There's good reason to believe that residents of certain states hold particularly positive or negative sentiments towards particular politicians. The Twitter API provides longitudes and latitudes for most of the scraped tweets. But we want to convert these coordinates to US states. One way to reverse geocode coordinates is by querying the Google Maps API. The package ggmap interfaces with the Google Maps API, and it makes reverse geocoding the coordinates very easy. Note however that the Google Maps API only allows 2500 queries per day, and the returns are slow. Another way is with geonames.org.

Rather than execute the code block below again at runtime, which would take a long time, I'll just load a previously geocoded dataset of tweets about Donald Trump from March 15, 2016, to demonstrate state-by-state sentiment mapping.

## Not run

#1 GGMAP reverse geo function to extract state & country
getGeo = function(lon,lat) {
  info <- revgeocode(c(lon,lat), "more")
  country <- info$country
  state <- info$administrative_area_level_1
  return(paste(state, country, sep=","))
}
#2 GEONAMES reverse geo function to extract state & country
library(geonames)
options(geonamesUsername="user")

getGeo = function(long,lat) {
  info <- GNfindNearestAddress(lat,lon)
  country <- info$countryCode
  state <- info$adminName1
  return(paste(state, country, sep=","))
}

# group identical coordinates
geoTweets$coord <- as.factor(paste(
  geoTweets$longitude, geoTweets$latitude, sep="."))

# apply to each group of identical coordinates
geoTweets <- geoTweets %>% rowwise() %>% 
  mutate(geoinfo = getGeo(longitude, latitude))

# give them their own columns
str <- str_split_fixed(geoTweets$geoinfo, ",", 2)
geoTweets$state <- str[,1]
geoTweets$country <- str[,2]

So I'll look at tweet sentiment about Donald Trump at the US state-level, for only those states where I have at least 2 tweets. I'll load my Trump data stored in a local file.

load('Trump_03-15-2016.Rda')

# tweets per region
geo_counts <- trump %>% 
  group_by(state) %>%
  summarise(., count=n()) 

# retain regions that have at least 3 tweets
geo_counts <- dplyr::filter(geo_counts, count >= 2)

# combine count data back with original data
df <- left_join(geo_counts, trump, by='state')

Compute by-state means of every numeric variable in the data. (Here we'll only look at valence).

state_means <- df %>% 
  group_by(state) %>%
  summarise_each(., 
  funs(mean(., na.rm = TRUE)), 
  which(sapply(., is.numeric)))

Make 5 state-level bins of valence.

state_means <- state_means %>% mutate(stateBins = cut(V.Mean.Sum, breaks = 6, labels=F))

Add missing states to my data.

library(maps)
all_states <- data.frame(state = map("state")$names)

plot of chunk unnamed-chunk-26

all_states$state <- gsub(':main', '', all_states$state)
myStates <- merge(all_states, state_means, by="state",all.x=T)

Create the map of by-state sentiment. Deeper reds signal more positive sentiment. The white states are those for which I have no data.

library(RColorBrewer)
colors = brewer.pal(6, 'Reds')

map("state", col = colors[myStates$stateBins], fill = TRUE, resolution = 0, 
    lty = 0, projection = "polyconic")
# Add border around each State
map("state", col = "black", fill = FALSE, add = TRUE, 
    lty = 1, lwd = 1, projection = "polyconic")
# Add extras
title("Donald Trump Twitter Sentiment")

plot of chunk unnamed-chunk-27

So on the eve of the March 15 primaries, Trump may be eliciting especially positive sentiment in Colorado, Lousiana, Alabama, Indiana and Ohio…