In this post I use R to perform sentiment analysis of Twitter data. Sentiment analysis is part of a broader set of tools available in the realm of NLP (natural language processing). For a more comprehensive overview of this area, this course is very helpful. Here I'll use the twitteR
package (interfaces with Twitter APIs) to obtain thousands of tweets about a few different presidential candidates. I'll then explore the content of these tweets in a few different ways. Sentiment analysis is often used to quantify users or consumer sentiment about online products or services. The input could be user reviews on Amazon or Google, or tweets about particular products or companies. Sentiment analysis can also be used to measure and even predict public opinion about political candidates, markets, or brands. The input to these analyses could be tweets or web search terms, for example. The output of sentiment analysis ranges from a simple binary measure (positive vs. negative sentiment), to more complex multidimensional measures of affect and attitude. Here I'll solely focus on a continuous measure of valence, or how words make people feel, which is available from a large database of human ratings.
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, collapse=TRUE, error=TRUE)
Interface with Twitter API
The R package twitteR
provides an interface to the Twitter API, which requires authentication via OAuth (open standard for authorization). The R package ROAuth
enables such authentification. To authenticate, I'll need my api_key, api_secret, from my app settings on my Twitter Application Management page. I have a local file 'creds' that contains my Consumer Key (API Key) and Consumer Secret (API Secret) that I got from my Twitter Application Management page.
require(twitteR) require(ROAuth) consumerKey <- creds$C_key consumerSecret <- creds$C_secret accessToken <- creds$A_token accessSecret <- creds$A_secret # This function wraps the OAuth authentication handshake functions # from the httr package for a twitteR session setup_twitter_oauth( consumer_key = consumerKey, consumer_secret = consumerSecret, access_token = accessToken, access_secret = accessSecret) ## [1] "Using direct authentication"
This function wraps the searchTwitter
function to collect several tweets about a 'topic', from a 2D spherical area roughly encompassing the USA (and then some). The function removes retweets, and returns a data frame where the first column is the name of the topic–we need this index when we supply multiple topics. The 'since' and 'until' parameters determine the time window within which tweets will be pulled.
library(dplyr) tweet2df = function(topic, N, since=NULL, until=NULL) { t <- searchTwitter(topic, n=N, since=since, until=until, geocode='39.8,-95.6,2400km', retryOnRateLimit=10, lang="en") # retain 'original' tweets only t <- strip_retweets(t, strip_manual = TRUE, strip_mt = TRUE) # convert to data frame df <- twListToDF(t) df$longitude <- as.numeric(df$longitude) df$latitude <- as.numeric(df$latitude) return(cbind(topic, df)) }
The lapply
call cycles through a few topics (I chose presidential candidates) and time windows, returning a data frame.
# search parameters days_ago <- 9:1 N <- 2500 topic <- c('Donald Trump', 'Ted Cruz', 'Hillary Clinton', 'Bernie Sanders') # create data frame containing all search parameters since <- format(Sys.Date()-days_ago) parms <- data.frame(expand.grid(since=since, N=N, topic=topic)) parms$until <- format(Sys.Date()-(days_ago-1)) i <- sapply(parms, is.factor) parms[i] <- lapply(parms[i], as.character) # Call custom function with all combinations of search parms # use dplyr::bind_rows instead of do.call(rbind(...)) timedTweets <- bind_rows(lapply(1:nrow(parms), function(x) tweet2df(parms$topic[x], N=N, since=parms$since[x], until=parms$until[x]))) %>% as.data.frame() # Call function again, not restricting the time range untimedTweets <- bind_rows(lapply(1:length(topic), function(x) tweet2df(topic[x], N=N))) %>% as.data.frame() # Combine into single data frame allTweets <- rbind(timedTweets, untimedTweets)
Let's add a factor corresponding to the date of each tweet, extracted from the 'created' field.
library(stringr) allTweets$day <- str_split_fixed(allTweets$created," ",2)[,1]
Text processing
The tweets in the 'text' column need to be tokenized. One way to go uses the tm
package, a comprehensive text mining framework for R. First we'll create a corpus from the tweets.
library(tm) # create a vector source, which interprets each element of its argument as a document v <- VectorSource(allTweets$text) # create an object of class 'corpus'; a collection of documents containing NL text docs <- Corpus(v) # convert corpus to plain text docs <- tm_map(docs, PlainTextDocument)
The goal now is to tokenize the raw text into units of some kind. Although there are ways to assess the sentiment of phrases and sentences, here the tokens will be individual words.
Using R base functions, first convert text into the ASCII scheme and then change all text to lower case.
docs <- tm_map(docs, content_transformer(function(x) iconv(x, to='ASCII', sub='byte'))) docs <- tm_map(docs, content_transformer(function(x) tolower(x)))
Then remove some of the most common words in English (e.g. you, me, my, the), that won't add anything to a sentiment analysis. These are called stop words in the NLP literature.
docs <- tm_map(docs, removeWords, stopwords('en'))
Next, let's remove strings that start with URL indicators.
# remove URLs stripURL = function(x) { gsub("www[^[:space:]]+|htt[^[:space:]]+", "", x) } docs <- tm_map(docs, content_transformer(stripURL))
We also don't need any numbers or punctuation marks, and extra whitespace can go. I won't deal with Emoticons in this post, but one generally should. I could continue to use base R functions to clean up the text (like above), but the tm
package has ready made functions for the common tasks.
docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, stripWhitespace)
It's often a good idea to stem the words in the corpus–to remove affix morphemes from stem morphemes (e.g., running becomes run). The affective word rating dataset I'll use here contains “love”, for example, but not “loved” or “loving”. We could use the relatively simple but very common Porter's algorithm, included as the default in the tm
package. I'll show the code here but I won't use it because I've found that the algorithm changes too many words incorrectly (e.g. 'immigrants' gets turned into 'immigr').
# docs <- tm_map(docs, stemDocument)
Convert the cleaned up text back to a character vector and attach to original data frame, then remove the original 'text' column. Also remove duplicate tweets within each topic. Now we're ready to go.
allTweets$tweet <- as.character(unlist(sapply(docs, `[`, "content"))) allTweets$text <- NULL # remove rows with duplicate tweets allTweets <- allTweets %>% group_by(topic) %>% distinct(., tweet)
Word clouds
First I'll make some word clouds, which is a good way to see the most frequent words in a corpus. Since the size of the words will reflect their frequency, I'll remove the topic (twitter search term) from each tweet first; otherwise these would dominate the word clouds.
# function to remove topic from all tweets about that topic removeTopic = function(topic, tweets) { words <- unlist(str_split(tolower(topic), boundary("word"))) pattern <- paste(words,sep="",collapse = "|") out <- gsub(pattern, '', tweets) return(out) } # call function rowwise allTweets <- allTweets %>% rowwise() %>% mutate(tweet = removeTopic(topic, tweet)) %>% as.data.frame() library(wordcloud) col=brewer.pal(8, 'Set1') topics <- unique(allTweets$topic) lapply(1:length(topics), function(x) { print(topics[x]) dat2cloud <- subset(allTweets, topic==topics[x]) text2cloud <- dat2cloud$tweet corp <- Corpus(VectorSource(text2cloud)) print(wordcloud(corp, max.words=75, random.color=F, random.order=F, colors=col)) } ) ## [1] "Donald Trump"
## NULL ## [1] "Ted Cruz"
## NULL ## [1] "Hillary Clinton"
## NULL ## [1] "Bernie Sanders"
## NULL ## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL
Sentiment analysis
There are several large lexicons containing valenced (good/bad) word measures. I downloaded the affective ratings, concreteness ratings, subtitle word frequency ratings, and POS tags available from Marc Brysbaert's research group at Ghent University. I merged across these datasets creating a single dataset, which contains over 13,000 English words.
lex <- read.csv('Brysbaert_ratings.csv', stringsAsFactors = F) head(lex) ## Word V.Mean.Sum V.SD.Sum A.Mean.Sum A.SD.Sum D.Mean.Sum D.SD.Sum ## 1 aardvark 6.26 2.21 2.41 1.40 4.27 1.75 ## 2 abandon 2.84 1.54 3.73 2.43 3.32 2.50 ## 3 abandonment 2.63 1.74 4.95 2.64 2.64 1.81 ## 4 abdomen 5.43 1.75 3.68 2.23 5.15 1.94 ## 5 abdominal 4.48 1.59 3.50 1.82 5.32 2.11 ## 6 abduct 2.42 1.61 5.90 2.57 2.75 2.13 ## Bigram Conc.M Conc.SD SUBTLEX Dom_Pos ## 1 0 4.68 0.86 21 Noun ## 2 0 2.54 1.45 413 Verb ## 3 0 2.54 1.29 49 Noun ## 4 0 4.70 0.67 171 Noun ## 5 0 4.28 1.16 174 Adjective ## 6 0 2.88 1.51 36 Verb
First I'll retain valenced words with ratings towards either pole of the 9-point valence rating scale. That is, words that people rated as making them feel at least somewhat happy/unhappy.
valence <- dplyr::filter(lex, V.Mean.Sum <= 4 | V.Mean.Sum >= 6)
I define a function that computes the mean and standard deviation of valence for all tweets about a given topic, on a given day.
# remove 'trump' from all tweets because it's also a verb and noun allTweets <- allTweets %>% rowwise() %>% mutate(tweet = gsub('trump','',tweet)) %>% as.data.frame() # by-tweet averages: for each row of the original df, take the mean of each numeric measure # across all words in that tweet that appear in the valence lexicon measures <- allTweets %>% rowwise() %>% do({ tweets <- unlist(str_split(.$tweet, boundary("word"))) dplyr::filter(valence, Word %in% tweets) %>% summarise_each(funs(mean), which(sapply(., is.numeric))) %>% as.data.frame() }) codedTweets <- bind_cols(allTweets, measures)
Let's look at the distribution of sentiment for each topic.
library(ggplot2) codedTweets$topic <- as.factor(codedTweets$topic) means <- codedTweets %>% group_by(topic) %>% summarise(mean = mean(V.Mean.Sum, na.rm = T)) print(means) ## Source: local data frame [4 x 2] ## ## topic mean ## (fctr) (dbl) ## 1 Bernie Sanders 6.100546 ## 2 Donald Trump 5.727789 ## 3 Hillary Clinton 5.416180 ## 4 Ted Cruz 5.763873 ggplot(codedTweets, aes(x=V.Mean.Sum)) + geom_histogram(bins=8) + geom_vline(data=means, aes(xintercept=mean), col=2) + facet_grid(topic ~ .)
Are these sentiments changing day by day? Let's visualize it.
byDay <- codedTweets %>% group_by(topic, day) %>% summarise_each(funs(mean(., na.rm = TRUE)), V.Mean.Sum, V.SD.Sum) ggplot(byDay, aes(x=as.Date(day), y=V.Mean.Sum, color=topic)) + geom_point() + geom_line()
It looks rather noisy. We'd need more data and probably more time to make any reliable inferences or construct a reliable time series.
Despite the relatively small size of these datasets, let's see how sentiment varies geographically. We'll need to bin the continuous sentiment variable first.
library(ggmap) # retain rows of data frame with geo data and valence geoTweets <- dplyr::filter(codedTweets, !is.na(longitude), !is.na(latitude), !is.na(V.Mean.Sum)) # For each topic, split the mean by-state tweet valence into 3 negative and 3 positive bins for plotting. geoTweets <- geoTweets %>% group_by(topic) %>% mutate(colorBins = cut(V.Mean.Sum, breaks=c(1,3,4,5,6,7,9), labels=F, include.lowest=T)) plot(geoTweets$V.Mean.Sum, geoTweets$colorBins)
These maps via ggmap
show the location of every tweet about each topic, where the color of the point varies from more negative sentiment (black) to more positive sentiment (red).
usa <- as.numeric(geocode("United States")) topics <- unique(allTweets$topic) lapply(1:4, function(x) { Map <- qmap("usa", zoom=4, color="bw", legend="bottomright") Map + geom_point(aes(x = longitude, y = latitude, color=colorBins), size=4, data=dplyr::filter(geoTweets, topic==topics[x])) + scale_colour_gradient(low = "black", high = "red") + labs(title = topics[x]) }) ## [[1]]
## ## [[2]]
## ## [[3]]
## ## [[4]]
There's good reason to believe that residents of certain states hold particularly positive or negative sentiments towards particular politicians. The Twitter API provides longitudes and latitudes for most of the scraped tweets. But we want to convert these coordinates to US states. One way to reverse geocode coordinates is by querying the Google Maps API. The package ggmap
interfaces with the Google Maps API, and it makes reverse geocoding the coordinates very easy. Note however that the Google Maps API only allows 2500 queries per day, and the returns are slow. Another way is with geonames.org.
Rather than execute the code block below again at runtime, which would take a long time, I'll just load a previously geocoded dataset of tweets about Donald Trump from March 15, 2016, to demonstrate state-by-state sentiment mapping.
## Not run #1 GGMAP reverse geo function to extract state & country getGeo = function(lon,lat) { info <- revgeocode(c(lon,lat), "more") country <- info$country state <- info$administrative_area_level_1 return(paste(state, country, sep=",")) } #2 GEONAMES reverse geo function to extract state & country library(geonames) options(geonamesUsername="user") getGeo = function(long,lat) { info <- GNfindNearestAddress(lat,lon) country <- info$countryCode state <- info$adminName1 return(paste(state, country, sep=",")) } # group identical coordinates geoTweets$coord <- as.factor(paste( geoTweets$longitude, geoTweets$latitude, sep=".")) # apply to each group of identical coordinates geoTweets <- geoTweets %>% rowwise() %>% mutate(geoinfo = getGeo(longitude, latitude)) # give them their own columns str <- str_split_fixed(geoTweets$geoinfo, ",", 2) geoTweets$state <- str[,1] geoTweets$country <- str[,2]
So I'll look at tweet sentiment about Donald Trump at the US state-level, for only those states where I have at least 2 tweets. I'll load my Trump data stored in a local file.
load('Trump_03-15-2016.Rda') # tweets per region geo_counts <- trump %>% group_by(state) %>% summarise(., count=n()) # retain regions that have at least 3 tweets geo_counts <- dplyr::filter(geo_counts, count >= 2) # combine count data back with original data df <- left_join(geo_counts, trump, by='state')
Compute by-state means of every numeric variable in the data. (Here we'll only look at valence).
state_means <- df %>% group_by(state) %>% summarise_each(., funs(mean(., na.rm = TRUE)), which(sapply(., is.numeric)))
Make 5 state-level bins of valence.
state_means <- state_means %>% mutate(stateBins = cut(V.Mean.Sum, breaks = 6, labels=F))
Add missing states to my data.
library(maps) all_states <- data.frame(state = map("state")$names)
all_states$state <- gsub(':main', '', all_states$state) myStates <- merge(all_states, state_means, by="state",all.x=T)
Create the map of by-state sentiment. Deeper reds signal more positive sentiment. The white states are those for which I have no data.
library(RColorBrewer) colors = brewer.pal(6, 'Reds') map("state", col = colors[myStates$stateBins], fill = TRUE, resolution = 0, lty = 0, projection = "polyconic") # Add border around each State map("state", col = "black", fill = FALSE, add = TRUE, lty = 1, lwd = 1, projection = "polyconic") # Add extras title("Donald Trump Twitter Sentiment")
So on the eve of the March 15 primaries, Trump may be eliciting especially positive sentiment in Colorado, Lousiana, Alabama, Indiana and Ohio…