Twitter Sentiment Analysis in R

In this post I use R to perform sentiment analysis of Twitter data. Sentiment analysis is part of a broader set of tools available in the realm of NLP (natural language processing). For a more comprehensive overview of this area, this course is very helpful. Here I'll use the twitteR package (interfaces with Twitter APIs) to obtain thousands of tweets about a few different presidential candidates. I'll then explore the content of these tweets in a few different ways. Sentiment analysis is often used to quantify users or consumer sentiment about online products or services. The input could be user reviews on Amazon or Google, or tweets about particular products or companies. Sentiment analysis can also be used to measure and even predict public opinion about political candidates, markets, or brands. The input to these analyses could be tweets or web search terms, for example. The output of sentiment analysis ranges from a simple binary measure (positive vs. negative sentiment), to more complex multidimensional measures of affect and attitude. Here I'll solely focus on a continuous measure of valence, or how words make people feel, which is available from a large database of human ratings.

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, collapse=TRUE, error=TRUE)

Interface with Twitter API

The R package twitteR provides an interface to the Twitter API, which requires authentication via OAuth (open standard for authorization). The R package ROAuth enables such authentification. To authenticate, I'll need my api_key, api_secret, from my app settings on my Twitter Application Management page. I have a local file 'creds' that contains my Consumer Key (API Key) and Consumer Secret (API Secret) that I got from my Twitter Application Management page.

require(twitteR)
require(ROAuth)

consumerKey <- creds$C_key
consumerSecret <- creds$C_secret
accessToken <- creds$A_token
accessSecret <- creds$A_secret

# This function wraps the OAuth authentication handshake functions
# from the httr package for a twitteR session
setup_twitter_oauth(
  consumer_key = consumerKey,
  consumer_secret = consumerSecret, 
  access_token = accessToken, 
  access_secret = accessSecret)
## [1] "Using direct authentication"

This function wraps the searchTwitter function to collect several tweets about a 'topic', from a 2D spherical area roughly encompassing the USA (and then some). The function removes retweets, and returns a data frame where the first column is the name of the topic–we need this index when we supply multiple topics. The 'since' and 'until' parameters determine the time window within which tweets will be pulled.

library(dplyr)

tweet2df = function(topic, N, since=NULL, until=NULL) {
  t <- searchTwitter(topic, n=N, since=since, until=until, 
                     geocode='39.8,-95.6,2400km',
                     retryOnRateLimit=10, lang="en")
  # retain 'original' tweets only
  t <- strip_retweets(t, strip_manual = TRUE, strip_mt = TRUE)
  # convert to data frame
  df <- twListToDF(t)
  df$longitude <- as.numeric(df$longitude)
  df$latitude <- as.numeric(df$latitude)
  return(cbind(topic, df))
}

The lapply call cycles through a few topics (I chose presidential candidates) and time windows, returning a data frame.

# search parameters 
days_ago <- 9:1
N <- 2500
topic <- c('Donald Trump', 'Ted Cruz', 'Hillary Clinton', 'Bernie Sanders')

# create data frame containing all search parameters
since <- format(Sys.Date()-days_ago)
parms <- data.frame(expand.grid(since=since, N=N, topic=topic))
parms$until <- format(Sys.Date()-(days_ago-1))
i <- sapply(parms, is.factor)
parms[i] <- lapply(parms[i], as.character)

# Call custom function with all combinations of search parms
# use dplyr::bind_rows instead of do.call(rbind(...))
timedTweets <- bind_rows(lapply(1:nrow(parms), 
  function(x) tweet2df(parms$topic[x], N=N, 
    since=parms$since[x], 
    until=parms$until[x]))) %>% 
  as.data.frame()

# Call function again, not restricting the time range
untimedTweets <- bind_rows(lapply(1:length(topic), 
  function(x) tweet2df(topic[x], N=N))) %>% 
  as.data.frame()

# Combine into single data frame 
allTweets <- rbind(timedTweets, untimedTweets)

Let's add a factor corresponding to the date of each tweet, extracted from the 'created' field.

library(stringr)
allTweets$day <- str_split_fixed(allTweets$created," ",2)[,1]

Text processing

The tweets in the 'text' column need to be tokenized. One way to go uses the tm package, a comprehensive text mining framework for R. First we'll create a corpus from the tweets.

library(tm)

# create a vector source, which interprets each element of its argument as a document
v <- VectorSource(allTweets$text)

# create an object of class 'corpus'; a collection of documents containing NL text
docs <- Corpus(v)

# convert corpus to plain text
docs <- tm_map(docs, PlainTextDocument)

The goal now is to tokenize the raw text into units of some kind. Although there are ways to assess the sentiment of phrases and sentences, here the tokens will be individual words.

Using R base functions, first convert text into the ASCII scheme and then change all text to lower case.

docs <- tm_map(docs, content_transformer(function(x) iconv(x, to='ASCII', sub='byte')))
docs <- tm_map(docs, content_transformer(function(x) tolower(x)))

Then remove some of the most common words in English (e.g. you, me, my, the), that won't add anything to a sentiment analysis. These are called stop words in the NLP literature.

docs <- tm_map(docs, removeWords, stopwords('en'))

Next, let's remove strings that start with URL indicators.

# remove URLs
stripURL = function(x) {
  gsub("www[^[:space:]]+|htt[^[:space:]]+", "", x)
}
docs <- tm_map(docs, content_transformer(stripURL))

We also don't need any numbers or punctuation marks, and extra whitespace can go. I won't deal with Emoticons in this post, but one generally should. I could continue to use base R functions to clean up the text (like above), but the tm package has ready made functions for the common tasks.

docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

It's often a good idea to stem the words in the corpus–to remove affix morphemes from stem morphemes (e.g., running becomes run). The affective word rating dataset I'll use here contains “love”, for example, but not “loved” or “loving”. We could use the relatively simple but very common Porter's algorithm, included as the default in the tm package. I'll show the code here but I won't use it because I've found that the algorithm changes too many words incorrectly (e.g. 'immigrants' gets turned into 'immigr').

# docs <- tm_map(docs, stemDocument)

Convert the cleaned up text back to a character vector and attach to original data frame, then remove the original 'text' column. Also remove duplicate tweets within each topic. Now we're ready to go.

allTweets$tweet <- as.character(unlist(sapply(docs, `[`, "content")))
allTweets$text <- NULL

# remove rows with duplicate tweets
allTweets <- allTweets %>% 
  group_by(topic) %>% 
  distinct(., tweet)

Word clouds

First I'll make some word clouds, which is a good way to see the most frequent words in a corpus. Since the size of the words will reflect their frequency, I'll remove the topic (twitter search term) from each tweet first; otherwise these would dominate the word clouds.

# function to remove topic from all tweets about that topic
removeTopic = function(topic, tweets) {
  words <- unlist(str_split(tolower(topic), boundary("word")))
  pattern <- paste(words,sep="",collapse = "|")
  out <- gsub(pattern, '', tweets)
  return(out)
}

# call function rowwise
allTweets <- allTweets %>% 
  rowwise() %>% 
  mutate(tweet = removeTopic(topic, tweet)) %>%
  as.data.frame()

library(wordcloud)
col=brewer.pal(8, 'Set1')

topics <- unique(allTweets$topic)

lapply(1:length(topics), function(x) {
  print(topics[x])
  dat2cloud <- subset(allTweets, topic==topics[x])
  text2cloud <- dat2cloud$tweet
  corp <- Corpus(VectorSource(text2cloud))
  print(wordcloud(corp, max.words=75, random.color=F, 
        random.order=F, colors=col))
  }
)
## [1] "Donald Trump"

plot of chunk unnamed-chunk-14

## NULL
## [1] "Ted Cruz"

plot of chunk unnamed-chunk-14

## NULL
## [1] "Hillary Clinton"

plot of chunk unnamed-chunk-14

## NULL
## [1] "Bernie Sanders"

plot of chunk unnamed-chunk-14

## NULL
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL

Sentiment analysis

There are several large lexicons containing valenced (good/bad) word measures. I downloaded the affective ratings, concreteness ratings, subtitle word frequency ratings, and POS tags available from Marc Brysbaert's research group at Ghent University. I merged across these datasets creating a single dataset, which contains over 13,000 English words.

lex <- read.csv('Brysbaert_ratings.csv', stringsAsFactors = F)
head(lex)
##          Word V.Mean.Sum V.SD.Sum A.Mean.Sum A.SD.Sum D.Mean.Sum D.SD.Sum
## 1    aardvark       6.26     2.21       2.41     1.40       4.27     1.75
## 2     abandon       2.84     1.54       3.73     2.43       3.32     2.50
## 3 abandonment       2.63     1.74       4.95     2.64       2.64     1.81
## 4     abdomen       5.43     1.75       3.68     2.23       5.15     1.94
## 5   abdominal       4.48     1.59       3.50     1.82       5.32     2.11
## 6      abduct       2.42     1.61       5.90     2.57       2.75     2.13
##   Bigram Conc.M Conc.SD SUBTLEX   Dom_Pos
## 1      0   4.68    0.86      21      Noun
## 2      0   2.54    1.45     413      Verb
## 3      0   2.54    1.29      49      Noun
## 4      0   4.70    0.67     171      Noun
## 5      0   4.28    1.16     174 Adjective
## 6      0   2.88    1.51      36      Verb

First I'll retain valenced words with ratings towards either pole of the 9-point valence rating scale. That is, words that people rated as making them feel at least somewhat happy/unhappy.

valence <- dplyr::filter(lex, V.Mean.Sum <= 4 | V.Mean.Sum >= 6)

I define a function that computes the mean and standard deviation of valence for all tweets about a given topic, on a given day.

# remove 'trump' from all tweets because it's also a verb and noun
allTweets <- allTweets %>% 
  rowwise() %>% 
  mutate(tweet = gsub('trump','',tweet)) %>%
  as.data.frame()

# by-tweet averages: for each row of the original df, take the mean of each numeric measure
# across all words in that tweet that appear in the valence lexicon
measures <- allTweets %>% 
  rowwise() %>% 
  do({
    tweets <- unlist(str_split(.$tweet, boundary("word")))
    dplyr::filter(valence, Word %in% tweets) %>%
    summarise_each(funs(mean), which(sapply(., is.numeric))) %>%
    as.data.frame()
  })
codedTweets <- bind_cols(allTweets, measures)

Let's look at the distribution of sentiment for each topic.

library(ggplot2)
codedTweets$topic <- as.factor(codedTweets$topic)
means <- codedTweets %>% group_by(topic) %>%
  summarise(mean = mean(V.Mean.Sum, na.rm = T))
print(means)
## Source: local data frame [4 x 2]
## 
##             topic     mean
##            (fctr)    (dbl)
## 1  Bernie Sanders 6.100546
## 2    Donald Trump 5.727789
## 3 Hillary Clinton 5.416180
## 4        Ted Cruz 5.763873

ggplot(codedTweets, aes(x=V.Mean.Sum)) +
  geom_histogram(bins=8) +
  geom_vline(data=means, aes(xintercept=mean), col=2) +
  facet_grid(topic ~ .)

plot of chunk unnamed-chunk-18

Are these sentiments changing day by day? Let's visualize it.

byDay <- codedTweets %>% 
  group_by(topic, day) %>% 
  summarise_each(funs(mean(., na.rm = TRUE)), V.Mean.Sum, V.SD.Sum)

ggplot(byDay, aes(x=as.Date(day), y=V.Mean.Sum, color=topic)) +
  geom_point() + geom_line()

plot of chunk unnamed-chunk-19

It looks rather noisy. We'd need more data and probably more time to make any reliable inferences or construct a reliable time series.

Despite the relatively small size of these datasets, let's see how sentiment varies geographically. We'll need to bin the continuous sentiment variable first.

library(ggmap)

# retain rows of data frame with geo data and valence 
geoTweets <- dplyr::filter(codedTweets, !is.na(longitude), 
  !is.na(latitude), !is.na(V.Mean.Sum))

# For each topic, split the mean by-state tweet valence into 3 negative and 3 positive bins for plotting.
geoTweets <- geoTweets %>% 
  group_by(topic) %>% 
  mutate(colorBins = cut(V.Mean.Sum, breaks=c(1,3,4,5,6,7,9), labels=F, include.lowest=T))
plot(geoTweets$V.Mean.Sum, geoTweets$colorBins)

plot of chunk unnamed-chunk-20

These maps via ggmap show the location of every tweet about each topic, where the color of the point varies from more negative sentiment (black) to more positive sentiment (red).

usa <- as.numeric(geocode("United States"))
topics <- unique(allTweets$topic)

lapply(1:4, function(x) {
  Map <- qmap("usa", zoom=4, color="bw", legend="bottomright")
  Map + geom_point(aes(x = longitude, y = latitude, color=colorBins), 
    size=4, data=dplyr::filter(geoTweets, topic==topics[x])) +
    scale_colour_gradient(low = "black", high = "red") +
    labs(title = topics[x])
})
## [[1]]

plot of chunk unnamed-chunk-21

## 
## [[2]]

plot of chunk unnamed-chunk-21

## 
## [[3]]

plot of chunk unnamed-chunk-21

## 
## [[4]]

plot of chunk unnamed-chunk-21

There's good reason to believe that residents of certain states hold particularly positive or negative sentiments towards particular politicians. The Twitter API provides longitudes and latitudes for most of the scraped tweets. But we want to convert these coordinates to US states. One way to reverse geocode coordinates is by querying the Google Maps API. The package ggmap interfaces with the Google Maps API, and it makes reverse geocoding the coordinates very easy. Note however that the Google Maps API only allows 2500 queries per day, and the returns are slow. Another way is with geonames.org.

Rather than execute the code block below again at runtime, which would take a long time, I'll just load a previously geocoded dataset of tweets about Donald Trump from March 15, 2016, to demonstrate state-by-state sentiment mapping.

## Not run

#1 GGMAP reverse geo function to extract state & country
getGeo = function(lon,lat) {
  info <- revgeocode(c(lon,lat), "more")
  country <- info$country
  state <- info$administrative_area_level_1
  return(paste(state, country, sep=","))
}
#2 GEONAMES reverse geo function to extract state & country
library(geonames)
options(geonamesUsername="user")

getGeo = function(long,lat) {
  info <- GNfindNearestAddress(lat,lon)
  country <- info$countryCode
  state <- info$adminName1
  return(paste(state, country, sep=","))
}

# group identical coordinates
geoTweets$coord <- as.factor(paste(
  geoTweets$longitude, geoTweets$latitude, sep="."))

# apply to each group of identical coordinates
geoTweets <- geoTweets %>% rowwise() %>% 
  mutate(geoinfo = getGeo(longitude, latitude))

# give them their own columns
str <- str_split_fixed(geoTweets$geoinfo, ",", 2)
geoTweets$state <- str[,1]
geoTweets$country <- str[,2]

So I'll look at tweet sentiment about Donald Trump at the US state-level, for only those states where I have at least 2 tweets. I'll load my Trump data stored in a local file.

load('Trump_03-15-2016.Rda')

# tweets per region
geo_counts <- trump %>% 
  group_by(state) %>%
  summarise(., count=n()) 

# retain regions that have at least 3 tweets
geo_counts <- dplyr::filter(geo_counts, count >= 2)

# combine count data back with original data
df <- left_join(geo_counts, trump, by='state')

Compute by-state means of every numeric variable in the data. (Here we'll only look at valence).

state_means <- df %>% 
  group_by(state) %>%
  summarise_each(., 
  funs(mean(., na.rm = TRUE)), 
  which(sapply(., is.numeric)))

Make 5 state-level bins of valence.

state_means <- state_means %>% mutate(stateBins = cut(V.Mean.Sum, breaks = 6, labels=F))

Add missing states to my data.

library(maps)
all_states <- data.frame(state = map("state")$names)

plot of chunk unnamed-chunk-26

all_states$state <- gsub(':main', '', all_states$state)
myStates <- merge(all_states, state_means, by="state",all.x=T)

Create the map of by-state sentiment. Deeper reds signal more positive sentiment. The white states are those for which I have no data.

library(RColorBrewer)
colors = brewer.pal(6, 'Reds')

map("state", col = colors[myStates$stateBins], fill = TRUE, resolution = 0, 
    lty = 0, projection = "polyconic")
# Add border around each State
map("state", col = "black", fill = FALSE, add = TRUE, 
    lty = 1, lwd = 1, projection = "polyconic")
# Add extras
title("Donald Trump Twitter Sentiment")

plot of chunk unnamed-chunk-27

So on the eve of the March 15 primaries, Trump may be eliciting especially positive sentiment in Colorado, Lousiana, Alabama, Indiana and Ohio…

Predictive modeling: Kaggle Titanic competition (part 3)

In my first Kaggle Titanic post and the followup post, I walked through some R code to perform data preprocessing, feature engineering, data visualization, and model building for a few different kinds of models. In this post I'll build a new feature set (including model-based missing data imputation), train and tune different models including random forests and neural networks, assess variable importance, perform feature selection, and submit a few entries to the Kaggle contest along the way. Some of the code and methodology in this post is reused from the previous posts where I provided more detail, so I recommend reading those first. But this post can stand on its own if you're already comfortable with R.

Preprocessing

I'll start by making a few neccessary changes to the raw data downloaded from the Kaggle Titanic competition website.

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, collapse=TRUE, error=FALSE)

# maintain the same random seed throughout 
rand.seed <- 99
set.seed(rand.seed)

# Kaggle files
train <- read.csv('train.csv', header=TRUE, na.strings = c("NA",""))
test <- read.csv('test.csv', header=TRUE, na.strings = c("NA",""))

changeTypes = function(df){
  df$Pclass <- as.factor(df$Pclass)
  df$Ticket <- as.character(df$Ticket)
  df$Cabin <- as.character(df$Cabin)
  df$Name <- as.character(df$Name)
  if("Survived" %in% names(df)){
    df$Survived <- as.factor(df$Survived)
  }
  return(df)
}
train <- changeTypes(train)
test <- changeTypes(test)

# extract honorifics and surnames
NameExtract = function(df) {
  # remove first names following period
  lastnames <- sub('\\..*', '', df$Name)
  # extract honorifics
  honorific <- sub('.*\\,', '', lastnames)
  # extract surnames
  surname <- sub('\\,.*', '', lastnames)
  # clean up white space etc.
  df$honorific  <- as.factor(gsub("^\\s+|\\s+$", "", honorific))
  df$surname  <- as.factor(gsub("^\\s+|\\s+$", "", surname))
  # we don't need the names anymore
  df$Name <- NULL
  return(df)
}
train <- NameExtract(train)
test <- NameExtract(test)

# transform honorifics into useable feature
honorificBuild = function(df){

  # function to replace old honorific(s) with new one
  replace.string <- function(df, original, replacement) {
    for (x in original) {
      if(x %in% levels(df$honorific)) {
       levels(df$honorific)[levels(df$honorific)==x] <- replacement
    }
  }
  return(df$honorific)
}
  # create ‘types’ of honorifics, try not to mix genders
    df$honorific <- replace.string(df, c('Dona','Ms','the Countess','Jonkheer','Lady'),'Mrs')
    df$honorific <- replace.string(df, c('Mme','Mlle'), 'Miss')
    df$honorific <- replace.string(df, c('Sir','Capt','Rev','Dr','Col','Don','Major'), 'Noble')
    df$honorific <- droplevels(df$honorific)

  return(df)
}
train <- honorificBuild(train)
test <- honorificBuild(test)

Here I'll impute a few missing Cabin names based on unique family names, and then retain the letters because they correspond to the deck of the ship. It seems like deck could be a useful feature.

DeckFeature = function(df){

  # remove digits from cabin names
  df$Cabin <- gsub('[[:digit:]]+', "", df$Cabin)
  # retain single letter (corresponds to the deck)
  df$Deck <- sub("^(\\w{1}).*$", "\\1", df$Cabin)
  df$Cabin <- NULL
  return(df)
}
train <- DeckFeature(train)
test <- DeckFeature(test)

There are a lot of missing 'Cabin' rows. One minor fix we can do is check if only one member of a given family was assigned a cabin. If so, we can assume that the other family members were in the same cabin. But finding members of a given family isn't as simple as picking unique surnames–more than one family might have the same surname. I'll create a variable 'FamID' that attempts to find surnames belonging to a single family. By pasting together the surname, the class, and the point of departure, I hope that these new ID variables accurately pick out unique families.

pickFams = function(df){
  df$FamID <- as.factor(paste(df$surname, df$Pclass, df$Embarked, sep=""))
  return(df)
}
train <- pickFams(train)
test <- pickFams(test)

There are a few cases in the 'train' dataset where at least one member of a family has a deck assigned to them, and at least one member of the same family does not. For example, in the subset of 'train' data where last names are 'Brown'.

subset(train, surname=='Brown')[, (ncol(train) - 3):(ncol(train))]
##     honorific surname Deck   FamID
## 195       Mrs   Brown    B Brown1C
## 346      Miss   Brown    F Brown2S
## 671       Mrs   Brown <NA> Brown2S
## 685        Mr   Brown <NA> Brown2S

Let's create a dplyr chain that fills in the 'NA' family members with the correct Deck. The 'yes' value for the ifelse function utilizes thena.locf function, which stands for last observation carried forward.

library(dplyr)

ImputeDeck = function(df){
  new_df <- df %>% 
    group_by(FamID) %>% 
    mutate(Deck =
             ifelse((n_distinct(Deck) > 1 & any(is.na(Deck))),
                    yes = zoo::na.locf(Deck),
                    no = Deck))  %>%
    as.data.frame()

  # change remaining NAs to 'U' for Unknown
  new_df$Deck[which(is.na(new_df$Deck))] <- 'U'
  # change to a factor
  new_df$Deck <- as.factor(new_df$Deck)
  # don't need FamID anymore
  new_df$FamID <- NULL
  return(new_df)
}
train <- ImputeDeck(train)
test <- ImputeDeck(test)

Here are those 'Brown' family members again. Looks good.

subset(train, surname=='Brown')[, (ncol(train) - 3):(ncol(train))]
##     Embarked honorific surname Deck
## 195        C       Mrs   Brown    B
## 346        S      Miss   Brown    F
## 671        S       Mrs   Brown    F
## 685        S        Mr   Brown    F

Next, ticket numbers might have some useful information. This function cleans them up, and extracts a few features.

TicketFeature = function(df) {
 # remove punctuation
  df$Ticket <- gsub('[^[:alnum:]]+', "", df$Ticket)
 # retain numbers only
  df$Tix.dig <- as.numeric(gsub('[[:alpha:]]+', "", df$Ticket))
 # perhaps the number of digits in the ticket is information
 # any tickets with less than 4 digits are relatively rare,
 # so subsume those into one level of this feature
  df <- df %>% mutate(tix.grp = ifelse(nchar(as.character(df$Tix.dig)) < 4, 
        3, nchar(as.character(df$Tix.dig)))) %>%
        as.data.frame()
  df$tix.grp <- as.factor(df$tix.grp)
  df$Tix.dig <- NULL
  df$Ticket <- NULL
  return(df)
}
train <- TicketFeature(train)
test <- TicketFeature(test)

Rather than replacing missing values of a numeric variable with its mean or median as we did in the previous post, we can treat the missing data problem as another predictive modeling problem. We can train a model to learn a functional relationship between predictors x,y,z and predictor n, using the subset of data in which n is non-missing. Then we can use this model to predict the missing values in n. Let's build a random forest to impute the missing values for 'Age', and the missing values (i.e., '0.00') for 'Fare' using a subset of predictors that I think are relevant.

library(partykit)

imputeForest = function(df, missing.var) {
  #special case for 'Fare'
  if(missing.var=="Fare"){
    df$Fare[df$Fare < 0.01] <- NA
  }
  missing.df <- df[is.na(df[ , missing.var]), ]
  missing.df[, missing.var] <- NULL
  existing.df <- df[!is.na(df[ , missing.var]), ]

  # make the response a variable 'y'
  existing.df$y <- existing.df[, missing.var]
  existing.df[, missing.var] <- NULL

  forest <- cforest(y ~ Sex+Pclass+honorific+SibSp+Parch, data=existing.df, ntree = 100)
  missing.df[ ,missing.var] <- predict(forest, newdata = missing.df)
  existing.df[, missing.var] <- existing.df$y
  existing.df$y <- NULL
  newdf <- rbind(missing.df, existing.df)
  return(newdf)
}
train <- imputeForest(train, "Age")
test <- imputeForest(test, "Age")
train <- imputeForest(train, "Fare")
test <- imputeForest(test, "Fare")

Are Age and Fare ready to be added to predictive models? First, there's no need for age to have the precision of several decimal places so let's make Age an integer variable. Infants < 1 year old will be coded as 1.

Age2Int = function(df){
  df$Age <- as.integer(df$Age)
  df$Age[which(df$Age==0)] <- 1
  return(df)
}
train <- Age2Int(train)
test <- Age2Int(test)

Second, a quick look at the distribution of Fare reveals that it's highly positively skewed–a few tickets were way more expensive than most.

library(ggplot2)
ggplot(train, aes(x=Fare)) + 
  geom_histogram(fill="grey") +
  theme_light()

plot of chunk unnamed-chunk-9

Taking the logarithm of Fare transforms it into a less skewed variable that may be more transparently related to survival, as is visible here.

ggplot(train, aes(x=Fare)) + 
  geom_histogram(aes(y=..density..), fill="grey") +
  geom_density(alpha=.2, fill="red") +
  scale_x_log10() +
  theme_light()

plot of chunk unnamed-chunk-10

A quick way to see the difference in how log(Fare) is related to Survival is to run logistic regression models with both versions of Fare. I included 'Pclass' as well to soak up shared variance.

summary(glm(Survived ~ Fare + Pclass, data=train, family='binomial'))
## 
## Call:
## glm(formula = Survived ~ Fare + Pclass, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9052  -0.7679  -0.7330   1.0807   1.7026  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.052002   0.226587   0.230   0.8185    
## Fare         0.006027   0.002386   2.526   0.0115 *  
## Pclass2     -0.288297   0.240812  -1.197   0.2312    
## Pclass3     -1.275935   0.227394  -5.611 2.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance: 1075.1  on 887  degrees of freedom
## AIC: 1083.1
## 
## Number of Fisher Scoring iterations: 4
summary(glm(Survived ~ log(Fare) + Pclass, data=train, family='binomial'))
## 
## Call:
## glm(formula = Survived ~ log(Fare) + Pclass, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6769  -0.8572  -0.6928   1.0835   1.7778  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.1892     0.5227  -2.275 0.022912 *  
## log(Fare)     0.4153     0.1223   3.396 0.000685 ***
## Pclass2      -0.1322     0.2521  -0.525 0.599907    
## Pclass3      -0.9671     0.2672  -3.620 0.000295 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance: 1071.6  on 887  degrees of freedom
## AIC: 1079.6
## 
## Number of Fisher Scoring iterations: 4

The AIC for the log Fare model, and p-value for log Fare, in comparison to those metrics from the 'raw' fare model, both suggest that this transformation is a good idea. So let's make it a feature in the dataframes. I'll also round it to 2 decimal places.

logFare = function(df) {
  df$Fare <- round(log(df$Fare), 2)
  return(df)
}
train <- logFare(train)
test <- logFare(test)

We're almost ready to build models. I'll fill in a few missing 'Embark' values, create a family size variable by adding SibSp and Parch, and remove the surname variable.

FinalTouches = function(df) {
  # impute 'Embark' by replacing any NAs with most common destination
  df$Embarked[which(is.na(df$Embarked))] <- 'S'
  # Family size
  df$FamSize <- df$SibSp + df$Parch
  # Get rid of surnames
  df$surname <- NULL
  return(df)
}
train <- FinalTouches(train)
test <- FinalTouches(test)

Decision tree

The first model I want to see is a conditional inference tree grown from the full training data. Since it's only a single tree, it's likely overfitting the data, but I'm interested here in understanding the relationships between the predictors and survival, not prediction (yet).

decision_tree <- ctree(Survived ~ . -PassengerId -Name, data = train)
plot(decision_tree, gp = gpar(fontsize=8))

plot of chunk unnamed-chunk-14

Consistent with the previous post, passenger class and honorific title are key predictors. The two largest terminal nodes in the tree are node 8, which contains 182 women and children in higher passenger classes (of which over 95% survived), and node 13, which contains the 319 men in the lowest passenger class (of which only 11% survived). Besides upper class women and children, the only other group that was more likely to survive than perish was lower class women and children belonging to families of less than four people–but even this group didn't fare that well.

Random forest

Now on to prediction. We saw in the previous post that an ensemble of decision trees called a random forest was our best predictive model so far. To review, random forests are the products of growing several trees from bootstrapped samples of the training data, where each tree has access to a randomly selected subset of available features. Here we'll use the caret package to fit several random forests, each having access to a different number of features. Based on the forest parameter sweep I did in the previous post, which suggested that several features were better, I'll only sample a small subset of this parameter's range. The 'best' forest is simply that with the highest averaged resampling (e.g., k-fold cross validation) accuracy. If you're wondering why best is in quotations, skip to the end of the post!

detach("package:partykit", unload=TRUE)
library(caret)
RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=10)
customGrid <- expand.grid(mtry=seq(16,20,by = 2))
t1 <- train(Survived ~ . -PassengerId, data = train, method="cforest",
            trControl = RfoldCV,  tuneGrid=customGrid)
print(t1)
## Conditional Inference Random Forest 
## 
## 891 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 803, 802, 802, 801, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   16    0.8254549  0.6191893  0.03498533   0.07824130
##   18    0.8292827  0.6269744  0.03441446   0.07789530
##   20    0.8278144  0.6238986  0.03396997   0.07640288
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 18.

The best accuracy here is a bit better than my previous random forest, so I submitted this model's predictions on the Kaggle test data. I scored about 76% accuracy, which is worse than several of my previous models. Why? I have a hunch that this model is over-parameterized–there were 28 features that made it into the forest. If we decide to simplify the model, which features should be transformed or dropped? One way to decide is to assess variable importance.

Variable importance

The importance of a feature in a predictive model can be estimated in several different ways (often depending on the type of model and its assumptions). In general, one can measure loss of prediction accuracy after removing a feature from a model. This version is similar in spirit to the common case of model comparison with the Chi-square test. Another example is linear models, where variable importance can be estimated with the absolute value of the t-statistic (ratio of effect size to standard error). The caret package includes several model-specific variable importance measures, as well as a generic version that doesn't rely on any particular model. Let's estimate variable importance for the random forest.

We'll use the permutation-based measure available in the party package. Each predictor is randomly permuted (shuffled row-wise) to remove its association with the response. Then this new variable along with the remainder of predictors are used to predict the response in the out-of-bag sample. The average drop in prediction accuracy after the variable has been permuted (across all trees) is the measure of importance. Let's see where our predictors stand.

model.imp <- varImp(t1, scale=F, useModel = T, conditional=F)
plot(model.imp)

plot of chunk unnamed-chunk-16

Consistent with the single tree grown above, variables related to socioeconomic status (Pclass, Fare), sex (Sex, honorific), and age (Age, honorific) seem to be the key predictors. What's not important? Notice that 'U' is the only level of Deck that's moderately important–but remember that this just means unknown. So one idea is to transform the deck variable into known vs. unknown. In addition, the ticket group variable seems fairly unimportant–we could just drop it.

One thing to note is that Strobl and colleagues have shown that the standard cforest permutation-based importance measure can be misled by spurious correlations (e.g., inability to recognize that age but not shoe size is the relevant predictor of reading skill). To remedy this concern, they created a conditional importance measure that takes into account the correlation structure among predictors in the permutation scheme. The result is that variable importance is conditional on the other predictors (c.f., semi-partial regression coefficients). This method is implemented by setting conditional to true in the varImp function; it's not run here because it's very time intensive.

Some predictive models don't have an obvious way to measure variable importance, in which case one can compute a 'generic' measure of variable importance that relies on ROC curve analysis. The general idea is to try several different cutoff values (decision thresholds) of each predictor, computing sensitivity and specificity for each (both measures are available from the confusion matrix; see part 1). The results can be plotted as an ROC curve (i.e., true positives plotted against false positives at each of these thresholds). Variable importance is taken as the area under the ROC curve. We can implement this measure by setting 'useModel' to FALSE.

auc.imp <- varImp(t1, scale=F, useModel = F)
plot(auc.imp)

plot of chunk unnamed-chunk-17

Many of the same conclusions can be drawn from this measure. But ticket group seems to be moderately important here, so let's take a look with a mosaic plot.

library(vcd)
mosaicplot(train$tix.grp ~ train$Survived, shade=F, color=T)

plot of chunk unnamed-chunk-18

Perhaps there's something unique about tickets with 5 digits: the majority of those people survived as opposed to every other ticket group. So let's transform deck and ticket group to binary variables that not only may be more predictive, but will reduce model complexity.

Simplify = function(df) {
  df$Deck <- ifelse(df$Deck=='U', 'unknown','known')
  df$tix.grp <- ifelse(df$tix.grp==5, 'five','other')
  df$Deck <- as.factor(df$Deck)
  df$tix.grp <- as.factor(df$tix.grp)
  return(df)
}
train <- Simplify(train)
test <- Simplify(test)

Now I'll build another random forest using these features, but this time I'll let train choose the 'mtry' parameters.

set.seed(rand.seed)
RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=10, allowParallel = TRUE)
t2 <- train(Survived ~ . -PassengerId, data = train, method="cforest", trControl = RfoldCV)
print(t2, details=T)
## Conditional Inference Random Forest 
## 
## 891 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##    2    0.7947205  0.5243128  0.03712724   0.09209961
##    9    0.8287313  0.6262203  0.03717407   0.08255746
##   16    0.8289648  0.6276099  0.03652213   0.08175874
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 16. 
## 
## ----------------------------------------------------------
## 
## The final model:
## 
## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  500 
## 
## Response:  .outcome 
## Inputs:  Pclass2, Pclass3, Sexmale, SibSp, Parch, EmbarkedQ, EmbarkedS, honorificMrs, honorificMaster, honorificMiss, honorificMr, Deckunknown, tix.grpother, Age, Fare, FamSize 
## Number of observations:  891

Accuracy is similar to the more complex random forest. Let's look again at model-based variable importance.

auc.imp2 <- varImp(t2, scale=F)
plot(auc.imp2)

plot of chunk unnamed-chunk-21

It looks like those new binary features might be adding some predictive power. On the other hand, there are at least three features that aren't adding much to the model. We could just drop the least important variables, and keep fitting models. But there's another way to choose the 'best' subset of features.

Feature selection

When a prediction problem involves only a handful of possible features, like this Titanic dataset, the modeler might have sufficient knowledge of the problem to hand select the most relevant predictor variables. But sometimes prediction problems come with tens of thousands of candidate predictors, for example in classification problems involving DNA sequences or large text documents. In these cases it is not usually feasible to hand select the relevant features. Automated feature selection techniques present a potential solution, but they come with some problems of their own–at least in their use and interpretation.

Two main classes of feature selection techniques are 'wrapper' techniques and 'filter' techniques. Here I'll focus on wrapper techniques. Most generally, wrapper techniques evaluate different subsets of available variables–sometimes including the interactions between them–to find a “best” model. If you're in the social or natural sciences, chances are you've already heard of or used one such technique–stepwise regression. In my experience, experimental scientists working with smaller datasets almost exclusively use wrapper techniques, and most commonly, forward stepwise regression. In this technique, a model containing a single predictor plus the intercept is compared to the null model containing only the intercept term. If there are 5 predictors, for example, 5 of these models are fitted. The model that explains the most additional variance in the response variable (or sometimes the variable with the lowest p-value) wins. Then this new model with two terms becomes the base model that is compared to each model containing one of each remaining candidate predictors.

Despite their popularity, stepwise regression methods have been criticized heavily for producing biased estimates, overfitting, and generally being a bit of a joke among professional statisticians. The problem may be most dramatic when no validation data is available, no resampling or cross-validation is used during selection, and the researcher reports estimates or predictions from the final selected model with no mention of how it was selected. Just as we saw in model selection, one way to alleviate some of these concerns is to assess out-of-sample prediction. This caret page addresses the importance of external validation not only for model selection, but for feature selection too:

It is important to realize that feature selection is part of the model building process and, as such, should be externally validated. Just as parameter tuning can result in over-fitting, feature selection can over-fit to the predictors (especially when search wrappers are used). In each of the caret functions for feature selection, the selection process is included in any resampling loops"

One such automatic wrapper feature selection technique is called recursive feature elimination (RFE), which has some parallels to stepwise regression methods. Let's see how RFE might help us find a better predictive model. RFE starts in the opposite position of forward selection methods, by first fitting the saturated model (containing all predictors), and then removing variables one at a time, comparing variable importance and model prediction for each smaller model. But the same issue arises that we saw in model selection, where the best set of predictors for a particular training dataset may not be replicated in a new dataset–hence the need for external validation. Caret provides the ability to nest the RFE procedure in an outer cross validation loop. Here's pseudocode for the algorithm:

  1. FOR each resampling iteration:
  2. ….Partition training data into test/train subsets
  3. ….Fit/tune model with all k predictors to partitioned train data
  4. ….Predict model responses on test data
  5. ….Note model accuracy and variable importances
  6. ….FOR each subset size S, from 1 … k-1 predictors:
  7. ……..Keep S most important predictors
  8. ……..Fit/tune model to train data using S predictors
  9. ……..Predict model responses on test data
  10. ……..(optional: recompute variable importance)
  11. ….END
  12. END
  13. Compute performance across the S using the hold-out test data
  14. Determine appropriate number of predictors
  15. Estimate final set of predictors to keep in full model
  16. Fit “best” model to original training data using optimal S

The caret rfe function implements the RFE algorithm. Its first two arguments are a dataframe of predictors and vector or factor of responses, so let's create those.

predictors <- select(train, c(-PassengerId, -Survived))
responses <- (select(train, Survived))$Survived

The function also takes a control object that includes the class of model to fit (and from which variable importance measures will be extracted), the method of resampling, number of folds (if you choose CV), and number of repeats. The caret documentation explains how to use any of the available models (or a custom model), but let's use the functions available in the randomForest package by specifying rfFuncs. The variable importance measure available in this implementation of random forests is based on Gini node impurity, which is a measure of the quality of a split in a decision tree. The greater the homogeneity of a node from that split, the lesser the impurity.

library(randomForest)
ctrl <- rfeControl(functions = rfFuncs, method = "repeatedcv",
  number=10, repeats=10, verbose=FALSE, allowParallel=TRUE)

The rfe function also takes a sizes argument specifying which subset sizes to include.
Let's run the algorithm including all possible subset sizes, and print the output.

rf.profile <- rfe(predictors, responses, sizes = 3:11, rfeControl = ctrl)
print(rf.profile)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 10 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          3   0.7942 0.5578    0.03931 0.08761         
##          4   0.8206 0.6118    0.04196 0.09276         
##          5   0.8367 0.6456    0.03750 0.08336        *
##          6   0.8306 0.6326    0.03751 0.08301         
##          7   0.8284 0.6287    0.03746 0.08241         
##          8   0.8293 0.6322    0.03668 0.07946         
##          9   0.8342 0.6429    0.03544 0.07791         
##         10   0.8341 0.6430    0.03363 0.07382         
##         11   0.8330 0.6403    0.03239 0.07127         
## 
## The top 5 variables (out of 5):
##    honorific, Sex, Fare, Pclass, Age
# NOT RUN:
# trellis.par.set(caretTheme())
# plot(rf.profile, type = c("g", "o"))

The best number of predictors is shown with a star in the 'Selected' column and can be found in rf.profile$bestSubset. Notice that a simple model with 5 variables also exhibited fairly high accuracy. Let's also plot the variable importance of each of the 'best' predictors–in this case a measure of node impurity from randomForest.

winners <- data.frame(var = rownames(rf.profile$fit$importance),
                      imp = rf.profile$fit$importance)
p <- arrange(winners, -winners[2])
barplot(p[,2], names.arg = p[,1], cex.names = .6, cex.axis = .6)

plot of chunk unnamed-chunk-25

Consistent with the cforest variable importance measure (and the single tree) from earlier, Pclass, Fare, Sex, Honorific, and Age seem to be the key predictors. Let's see if we can build a model with just these 5 predictors that fares as well as the more complex models. But let's leave the forest and enter the realm of a very different type of predictive model.

Neural network

Artifical neural networks have been around since the 1950s. The basic idea is that inputs are mapped onto output(s) by one or more layers of processing units (neurons). Each unit's output is expressed as a function of the input, the weights of the connections to that unit, and sometimes a constant bias term. The function itself can take any form, but in practice a sigmoid is often used. The weights are learned from repeated exposure to labeled training data.

A single layer perceptron was among the first artificial neural networks. Consider the simplest case when a single processing unit takes a set of inputs, and learns to produce an output for each input. When the activation function is the nonlinear logistic (sigmoid) function, the output is constrained between 0 and 1 and can be interpreted as the probability that the input belongs to one or another class. This simple one unit “network” basically implements logistic regression. But the power of neural nets becomes apparent when one or more hidden layers of processing units are placed between the inputs and output. In a 3-layer perceptron, a hidden layer of processing units act as intermediaries between the input and output. One way to think of the hidden layer units' role is that they represent the inputs in a new way that the output can use to solve the learning problem… even when it is not linearly separable. The standard example is the non-linearly separable XOR problem. A multilayer perceptron can learn a nonlinear mapping via the hidden layer, whereas a single layer perceptron (and likewise logistic regression) cannot.

Let's load the R package 'nnet', which includes the function nnet to fit a multilayer perceptron. First let's confirm that whereas logistic regression can't solve XOR, a 3-layer perceptron with 2 hidden units can.

xor <- data.frame(x1=c(1,1,0,0), x2=c(1,0,1,0), y=c(1,0,0,1))
xor
##   x1 x2 y
## 1  1  1 1
## 2  1  0 0
## 3  0  1 0
## 4  0  0 1

# logistic regression does not predict the correct output
log.reg <- glm(y ~ x1 + x2, data=xor, family = 'binomial')
round(predict(log.reg, type = 'response'), 1)
##   1   2   3   4 
## 0.5 0.5 0.5 0.5

# a multilayer perceptron does
library(nnet)
set.seed(rand.seed)
mlp <- nnet(y ~ x1 + x2, data=xor, size=2)
## # weights:  9
## initial  value 1.000608 
## iter  10 value 0.996987
## iter  20 value 0.757799
## iter  30 value 0.601836
## iter  40 value 0.002959
## iter  50 value 0.001166
## iter  60 value 0.000850
## final  value 0.000088 
## converged
round(mlp$fitted.values, 1)
##   [,1]
## 1    1
## 2    0
## 3    0
## 4    1

Now back to the Titanic dataset. Let's train a multilayer perceptron to map those 5 most important variables we found above, to the binary response. Since the response (coded as a factor) is binary, the nnet function will build a neural net with an output layer consisting of a single output unit (the default activation function is logistic). We get to decide how many hidden units to include in the hidden layer with the 'size' parameter.

set.seed(rand.seed)
n.net <- nnet(Survived ~ honorific+Sex+Fare+Pclass+FamSize, data=train, size=2)
## # weights:  23
## initial  value 597.332481 
## iter  10 value 425.155736
## iter  20 value 363.864709
## iter  30 value 349.029447
## iter  40 value 345.704494
## iter  50 value 343.522282
## iter  60 value 342.907737
## iter  70 value 342.751372
## iter  80 value 342.642992
## iter  90 value 342.586077
## iter 100 value 342.383923
## final  value 342.383923 
## stopped after 100 iterations

We can visualize the structure of the network with NeuralNetTools.

library(devtools)
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
plot.nnet(n.net, bias = F, node.labs = F, cex.val = .8, circle.cex = 2.5)

plot of chunk unnamed-chunk-28

The positive weights are black, the negative weights are grey, and thicker lines corresponds to larger weights. Just as in logistic regression, the predicted values from the network's output unit are probabilities between 0 and 1. As we did with logistic regression in the first post, we can create the confusion matrix by rounding the outputs to either 0 or 1. How well does this network fit the training data in terms of observed accuracy?

# confusion matrix
cm <- as.matrix(table(train$Survived, round(n.net$fitted.values, 0)))
# observed accuracy
round((cm[1,1] + cm[2,2]) / sum(cm),2)
## [1] 0.83

Actually it's about the same as the more complex random forest models. But how does this feedforward multilayer neural network do as a predictive model? Let's move back into the caret framework, which supports a number of R neural network packages including nnet. The two parameters under control for nnet are 'size' (# hidden units), and decay (weight decay). The weight decay parameter controls the degree to which large weights are penalized. Shrinking the weights (coefficients) towards zero can prevent overfitting. Therefore we might expect the 'optimal' cross validated model to include at least some weight decay.

set.seed(rand.seed)
RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=10, allowParallel=TRUE)
myGrid <- expand.grid(size=1:3, decay=seq(.01, .1, by=.01))

train.nnet <- train(Survived ~ honorific+Sex+Fare+Pclass+FamSize, data = train, 
  method="nnet", trControl = RfoldCV, tuneGrid = myGrid, trace=F)
print(train.nnet, details=TRUE)
## Neural Network 
## 
## 891 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   1     0.01   0.8194152  0.6057693  0.03550613   0.08019971
##   1     0.02   0.8198685  0.6066374  0.03527554   0.07959338
##   1     0.03   0.8207649  0.6086724  0.03529387   0.07948190
##   1     0.04   0.8209883  0.6091715  0.03544306   0.07971195
##   1     0.05   0.8210982  0.6095437  0.03625613   0.08130311
##   1     0.06   0.8215451  0.6106394  0.03596764   0.08050056
##   1     0.07   0.8213216  0.6102566  0.03564968   0.07972147
##   1     0.08   0.8214315  0.6105840  0.03617944   0.08099885
##   1     0.09   0.8218809  0.6117630  0.03576466   0.07971096
##   1     0.10   0.8215438  0.6110636  0.03580264   0.07988065
##   2     0.01   0.8220033  0.6126249  0.03703068   0.08345254
##   2     0.02   0.8196475  0.6073290  0.03687800   0.08277538
##   2     0.03   0.8206537  0.6095724  0.03618922   0.08196159
##   2     0.04   0.8214302  0.6110706  0.03669082   0.08208453
##   2     0.05   0.8216562  0.6115778  0.03687526   0.08289307
##   2     0.06   0.8198609  0.6071371  0.03586885   0.08137523
##   2     0.07   0.8226599  0.6135260  0.03499636   0.07936349
##   2     0.08   0.8208759  0.6096534  0.03560528   0.07914850
##   2     0.09   0.8231219  0.6142658  0.03416244   0.07754455
##   2     0.10   0.8208772  0.6089246  0.03474508   0.07888405
##   3     0.01   0.8169483  0.6036379  0.03602721   0.08023585
##   3     0.02   0.8166061  0.6020818  0.03670291   0.08157090
##   3     0.03   0.8202043  0.6106096  0.03691266   0.08147574
##   3     0.04   0.8183964  0.6059040  0.03556919   0.07989092
##   3     0.05   0.8195226  0.6087452  0.03654531   0.08026656
##   3     0.06   0.8193067  0.6083571  0.03599841   0.08035233
##   3     0.07   0.8196262  0.6086705  0.03706570   0.08278153
##   3     0.08   0.8200794  0.6090686  0.03586894   0.08112879
##   3     0.09   0.8198521  0.6090701  0.03741066   0.08343753
##   3     0.10   0.8197473  0.6090644  0.03528446   0.07860332
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were size = 2 and decay = 0.09. 
## 
## ----------------------------------------------------------
## 
## The final model:
## 
## a 9-2-1 network with 23 weights
## options were - entropy fitting  decay=0.09
##  b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 i9->h1 
##  -0.63  -0.39  -1.50  -0.02   0.72  -0.08  -0.94  -1.40   2.84   0.85 
##  b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2 i9->h2 
##  -2.77  -0.23  -3.31  -0.91   0.15   2.92   0.05   1.52  -1.21  -0.11 
##  b->o h1->o h2->o 
##  3.41 -5.05 -6.32
ggplot(train.nnet)

plot of chunk unnamed-chunk-30

Looking at the parameter plot, there's a general pattern of increasing cross validated accuracy with increasing weight decay, which we might have expected given the negative relationship between parameter penalization and overfitting. It also looks like even 3 hidden units is too many. Let's fit this neural net to the test data and submit the predicted responses to Kaggle.

Survived <- predict(train.nnet, test)
submission <- data.frame(PassengerId = test$PassengerId, Survived=Survived)
write.csv(submission, 'nnet_5-predictors.csv', row.names = FALSE)

This submission scored 78.9%; better than some models I've submitted so far, but worse than the best random forest from the 2nd post. Remember that this neural net only had the 5 'most important' features to choose from, but according to the feature selection technique results, there were 9 predictors in the best model. Let's tune another set of nets using these 9 best predictors (i.e., all of them except Parch and SibSp), but also expanding the range of the weight decay parameter.

set.seed(rand.seed)
RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=10, allowParallel=TRUE)
myGrid <- expand.grid(size=1:3, decay=seq(.05, .5, by=.05))

train.2.nnet <- train(Survived ~ . -PassengerId -SibSp -Parch, 
  data = train, method="nnet", trControl = RfoldCV, tuneGrid = myGrid, trace=F)
print(train.2.nnet, details=TRUE)
## Neural Network 
## 
## 891 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa      Accuracy SD  Kappa SD  
##   1     0.05   0.8283966  0.6262284  0.03666416   0.08169154
##   1     0.10   0.8296389  0.6289615  0.03638087   0.08101925
##   1     0.15   0.8304267  0.6308685  0.03662570   0.08165744
##   1     0.20   0.8306514  0.6313702  0.03670954   0.08176644
##   1     0.25   0.8305403  0.6314395  0.03722916   0.08277337
##   1     0.30   0.8298661  0.6301451  0.03761542   0.08372796
##   1     0.35   0.8297588  0.6299828  0.03735633   0.08332867
##   1     0.40   0.8295354  0.6296764  0.03798923   0.08459876
##   1     0.45   0.8289748  0.6285396  0.03792286   0.08443648
##   1     0.50   0.8289748  0.6287698  0.03819255   0.08504797
##   2     0.05   0.8291869  0.6270824  0.03895579   0.08725155
##   2     0.10   0.8287336  0.6257477  0.03533538   0.07927731
##   2     0.15   0.8304166  0.6284736  0.03759343   0.08497172
##   2     0.20   0.8306502  0.6299061  0.03626282   0.08130310
##   2     0.25   0.8321058  0.6334814  0.03711895   0.08284028
##   2     0.30   0.8312158  0.6317397  0.03844051   0.08602783
##   2     0.35   0.8305253  0.6308421  0.03672171   0.08233385
##   2     0.40   0.8321058  0.6346600  0.03758149   0.08403713
##   2     0.45   0.8301919  0.6306094  0.03731799   0.08322976
##   2     0.50   0.8300845  0.6308015  0.03805407   0.08470974
##   3     0.05   0.8258236  0.6197355  0.03714375   0.08402461
##   3     0.10   0.8269446  0.6220689  0.03564389   0.08152226
##   3     0.15   0.8300657  0.6280479  0.03795216   0.08591197
##   3     0.20   0.8313104  0.6315346  0.03727429   0.08440573
##   3     0.25   0.8320983  0.6331346  0.03785046   0.08506085
##   3     0.30   0.8291869  0.6274166  0.03656718   0.08187601
##   3     0.35   0.8297538  0.6290897  0.03825896   0.08519744
##   3     0.40   0.8290684  0.6282978  0.03595013   0.08035171
##   3     0.45   0.8292879  0.6285565  0.03736792   0.08317332
##   3     0.50   0.8289684  0.6283507  0.03901034   0.08642523
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were size = 2 and decay = 0.4. 
## 
## ----------------------------------------------------------
## 
## The final model:
## 
## a 14-2-1 network with 33 weights
## options were - entropy fitting  decay=0.4
##   b->h1  i1->h1  i2->h1  i3->h1  i4->h1  i5->h1  i6->h1  i7->h1  i8->h1 
##   -0.51    1.06    0.38    0.36    0.09    0.35   -0.31   -0.95   -0.23 
##  i9->h1 i10->h1 i11->h1 i12->h1 i13->h1 i14->h1 
##    0.39    1.26    0.18    0.07   -1.40    0.95 
##   b->h2  i1->h2  i2->h2  i3->h2  i4->h2  i5->h2  i6->h2  i7->h2  i8->h2 
##    1.08    0.96   -0.76   -1.82    0.06   -0.46    1.19    2.27    0.49 
##  i9->h2 i10->h2 i11->h2 i12->h2 i13->h2 i14->h2 
##   -1.68   -0.42   -0.60   -0.01    0.10   -0.20 
##  b->o h1->o h2->o 
## -0.25 -2.51  4.71
ggplot(train.2.nnet)

plot of chunk unnamed-chunk-32

This time the sweet spot of weight decay appears to be between about .2 and .4. The best network had 2 hidden units and used weight decay of 0.25. It's interesting that even quite large weight decay (up to .5) doesn't seem to impair validated accuracy very much.

Okay, here goes my last Kaggle submission. The results are … that this network's predictions scored a 78.5%; no real change from the simpler network, and worse than my best submission from the previous post. But I don't care, and here's why.

The best model?

Maybe it would've been nice to end with my best submission to Kaggle… but the thing is, maybe I did and I just don't know it. Kaggle holds back two validation sets–one that is used for the leaderboard and one that isn't revealed until the end of the contest. Although it may be tempting to take the Kaggle scores on the public leaderboard as arbiters of pure unbiased classification performance (…the website spins, you take a breath, and then you get a quick hit of feedback…), they are not. If you were to submit 10 models a day, each tuned a little differently to carve out small gains on the leaderboard, then you have almost surely created a biased classifier. That is, you've fostered dependence between the classifier and the test data, and your classifier is more likely to perform poorly on new (previously unseen) validation data. A quick google search reveals a few proposed solutions to the leaderboard issue, including this one. Beyond prediction competitions, the leaderboard bias is related to a more general, and (as far as I can tell) unresolved issue in predictive modeling.

The issue is how to use a subset of data (OLD), to make the most accurate predictions about new data (NEW), where both datasets were generated by the same population (P). In practice, OLD and NEW are often random subsets of an existing dataset. For example the Titanic data we've been working with are OLD, the leaderboard test data are NEW, and both are subsets of the entire passenger list (P). An implicit assumption that I've been making in these posts is that estimating a model's prediction accuracy on OLD by averaging cross-validated out-of-sample accuracy estimates is better than no cross-validation, and repeated k-fold cross-validation is better than a single run of cross-validation. By better, I mean a less biased estimate of a given model's accuracy in predicting P (not just NEW). But these assumptions may not be valid.

Vanwinckelen and Blockeel (2012) showed in simulation experiments on several large datasets that increasing the repetitions of cross-validation (increasing the number of prediction accuracy estimates to aggregate) on OLD does not converge on the true / population accuracy (i.e., using the whole dataset, or P). Although the confidence intervals of the estimates did shrink with increasing repetition (i.e., the variance of the CV estimator decreased), these tighter intervals were shown to be less likely to contain the population accuracy. The authors also provide evidence that since the training data is always smaller than the full dataset during cross-validation, the accuracy estimates exhibit pessimistic bias–they underestimate the true accuracy. This is just one paper of many in this area, and their findings may not generalize to all cases. But it highlights the difficulty of answering certain questions. How do I know if my model is the best model? How accurate are my model's predictions? The answer of course is that you can't know, you can only estimate and revise.

Predictive modeling: Kaggle Titanic competition (part 2)

In a previous post, I documented data preprocessing, feature engineering, data visualization, and model building that culminated in a submission to the Kaggle Titanic competition. I used a linear logistic regression model with a set of features that included some engineered features and some 'out-of-the-box' features, to classify passengers into those who survived and those who perished. This model resulted in about 78% classification accuracy, placing me within the top 40% submissions… not bad, but not great. Let's see if we can do better here by delving into classification trees, ensemble methods like random forests, and the versatile caret package.

Decision tree learning

We've all seen a decision tree before. They often appear as flowcharts, where for example a series of yes/no decisions lead to a particular action. In predictive analysis, the goal of decision tree analysis is to learn or “grow” a tree that effectively predicts a particular outcome (among other outcomes), as a function of input variables (features). Trees are typically learned by a process called recursive partitioning, where the outcome variable is partitioned into subsets of similar or identical responses based on a “test” derived from a particular feature (e.g., Age > 30?). Each branch of the tree represents the outcome of a test (e.g., Yes/No), and each leaf/terminal node is one level of the outcome variable (e.g., survived/perished).

From my background designing and analyzing experiments, one particular feature of decision trees is especially attractive: the ability to include one or several two-way or higher order interactions between variables/features. A hypothetical example of an interaction in a sinking ship disaster is that older males are more likely to perish than survive, but younger females are more likely to perish than older females. That is, gender could interact with age. None of the models in the previous post included any interaction terms, two-way, three-way, or higher. Nor did I even examine any interactions. When there are many possible features and not so many observations (small n large p problems), but even in the present case, the inclusion of (and inspection of) all possible combinations of higher order interactions is not feasible. Additionally, the parameters in substantially more complex models with several and/or higher order interactions often cannot be reliably estimated with ordinary least squares or maximum likelihood estimation.

Preprocessing

I'll load the preprocessed training and test data I created in the previous post. I'll include some of the original features that I removed before model fitting in the previous post. This time around, I'll rely less on my domain knowledge / hunches about which variables to include, and more on the models. I did however remove a few of the original variables that wouldn't make good features, as well as my 'CabinAlpha', 'Good' and 'Bad' cabin features that I'm not happy with. As we did before, I'll perform the identical operations on both the training data and test data provided by Kaggle.

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, collapse=TRUE, error=TRUE)

train <- read.csv('preprocessed_train_data.csv', header = TRUE)
test <- read.csv('preprocessed_test_data.csv', header = TRUE)

CleanDF = function(df){
  delete <- c("honorific", "Name", "Ticket", "Cabin", "CabinAlpha", "BadCabin", "GoodCabin")
  return(df[, !(names(df) %in% delete)])
}

train <- CleanDF(train)
test <- CleanDF(test)

Upon inspecting the variable types in the new dataframe I found a few variable types that need to be changed. This function will do the trick.

#function to change some variable types
changeVars = function(df){
  #change ordinal variables coded as numeric to factors
  df$Pclass <- as.factor(df$Pclass)
  #the outcome variable will only exist in the training data
  if("Survived" %in% names(df)){
    df$Survived <- as.factor(df$Survived)
  }
  #return new data frame
  return(df)
}
train <- changeVars(train)
test <- changeVars(test)

Classification trees

Although several different algorithms are available for decision tree learning, I'll only use one type here, called conditional inference trees. In writing this post I relied heavily on an excellent introduction to classification trees and recursive partitioning more generally by Carolin Strobl and colleagues. Let's start by learning some conditional inference trees. The R functions ctree (which implements classification and regression trees) and cforest (which implements random forests) are both available in the partykit package. As explained in the documentation, the ctree algorithm works roughly like this:

1) Test the global null hypothesis of independence between any of the input variables and the response. Stop if this hypothesis cannot be rejected. Otherwise select the input variable with the strongest association to the response. This association is measured by a p-value corresponding to a test for the partial null hypothesis of a single input variable and the response.
2) Implement a binary split in the selected input variable.
3) Recursively repeat steps 1) and 2).

# set a random seed to ensure repeatability
set.seed(99)
require(partykit)

The ctree function takes an argument for an object of class “TreeControl”, returned by the function ctree_control, containing various parameters for tree learning. Let's start simple, with a tree that is forced to have only 1 split (i.e., only 1 input variable will be selected to partition the outcome variable). I'll call it m5 for model 5, carried over from the previous post using logistic regression models.

tc_d1 <- ctree_control(maxdepth = 1)
m5 <- ctree(Survived ~ ., data = train, control = tc_d1)
print(m5)
## 
## Model formula:
## Survived ~ Pclass + PassengerId + Sex + Age + SibSp + Parch + 
##     Fare + Embarked + median.hon.age + median.class.fare + honorificType + 
##     CabinAlphaUno + FamSize + Fare.pp + wcFirst
## 
## Fitted party:
## [1] root
## |   [2] CabinAlphaUno in A, T, U: 0 (n = 703, err = 30.3%)
## |   [3] CabinAlphaUno in B, C, D, E, F, G: 1 (n = 188, err = 31.4%)
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2

The printed output shows that the single split was on the variable 'CabinAlphaUno', meaning it exhibited the strongest association with Survival. I crudely created this feature from the existing 'Cabin' variable, by simply taking the first letter of the string. So this tells me that there's probably better feature engineering to be done with the Cabin variable. I'm going to bypass 'CabinAlphaUno' for a moment and grow a tree that doesn't have access to it.

m6 <- ctree(Survived ~ . -CabinAlphaUno, data = train, control = tc_d1)
print(m6)
## 
## Model formula:
## Survived ~ Pclass + PassengerId + Sex + Age + SibSp + Parch + 
##     Fare + Embarked + median.hon.age + median.class.fare + honorificType + 
##     FamSize + Fare.pp + wcFirst
## 
## Fitted party:
## [1] root
## |   [2] honorificType in Master, Miss, Mrs: 1 (n = 351, err = 27.9%)
## |   [3] honorificType in Mr, Noble: 0 (n = 540, err = 16.5%)
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2

Now the strongest association with the outcome variable was the honorifics I extracted from the passenger names. This feature was used to split the outcomes into 2 terminal nodes, with what appears to be mostly men (save a few female noblepersons) on one side of the split, and women plus boys on the other. Let's plot a simple version of this classification tree.

plot(m6, type="simple")

plot of chunk unnamed-chunk-5

The grey terminal nodes provide the number of observations per node, plus the classification error. So 28% of the women and boys did not survive, whereas 17% of the men and “nobles” did. The default plot helps visualize these ratios, with color representing the outcome.

plot(m6)

plot of chunk unnamed-chunk-6

It looks to me like the main effect of 'honorificType' is actually capturing an Age by Sex interaction. Let's grow a tree that is forced to include just those two variables.

m7 <- ctree(Survived ~ Sex+Age, data = train)
plot(m7)

plot of chunk unnamed-chunk-7

Yep. This tree represents a statistically significant Age x Sex interaction. Females generally survived regardless of age, but for males age mattered. But, even the boys (males under 12) fared worse than the females as a monolithic group; just slightly more than half of the boys survived. So the “women and children first” strategy didn't work completely. But still my hunch is that if we don't let the tree select either the CabinAlphaUno or the hhonorific variables, then “wcFirst” will be selected first among other candidates. Let's find out.

m8 <- ctree(Survived ~ wcFirst, 
  data = train[, !(names(train)%in% c('CabinAlphaUno', 'honorificType'))], control = tc_d1)
plot(m8)

plot of chunk unnamed-chunk-8

Indeed, the “women and children first” feature is the 3rd runner up for the initial split. Ok, curiosity satisfied. Now let's stop restricting the tree, and let the algorithm decide on its own which features to choose, and when to stop growing the tree.

m9 <- ctree(Survived ~ ., data = train)
plot(m9)

plot of chunk unnamed-chunk-9

Now that's a tree. There are several higher-order interactions here, including something that you don't see in linear models with interactions terms (including the experimenter's choice: analysis of variance). That is, the 'CabinAlphaUno' is split once at the first level, and then again two levels deeper on a branch that stems from itself. This is entirely possible in decision trees, because at each split every variable is tested once again.

Does this tree reveal anything interpretable over and above the Age x Sex interaction we already saw (and that is driving the second-level splits here)? For one thing, cabin has a particularly strong association with survival, winning the spot for the first split–despite that over ¾ of the passengers' cabin assignments were unknown ('U'). The other striking relationship I notice is that among the women and children (leftmost branch) possessing a basic class ticket (#3), family size matters a lot. For example, among those 45 people belonging to larger families (4+ people), almost none survived.

So how well does this tree's predicted/fitted outcomes capture the observed outcomes? I'll call the ClassPerform function defined in the previous post, which takes a confusion matrix as its argument and returns two measures of binary classification performance: observed accuracy and kappa. Also note that the tree's predicted responses (which we need to build the confusion matrix) are obtained in the usual way with R's predict function.

# Confusion matrix 
m9.cm <- as.matrix(table(train$Survived, predict(m9, train, type = "response")))
ClassPerform(m9.cm)
##    ClassAcc     Kappa
## 1 0.8338945 0.6421279

This tree fits the observed data no better than the much simpler logistic regression model in the previous post. But does this neccessarily mean that its predictive accuracy is no better? No, because a model's fit to the observed data is a measure of its explanatory power (assuming an underlying casual theory, but that's another issue). At the very least, these measures can help sort out which model best describes known data. Prediction and explanation are often conflated or poorly specified, including in the professional scientific literature (e.g., probably in one of my papers). I like this description of each by Shmueli:

In explanatory modeling the focus is on minimizing bias to obtain the most accurate representation of the underlying theory. In contrast, predictive modeling seeks to minimize the combination of bias and estimation variance, occasionally sacrificing theoretical accuracy for improved empirical precision… the “wrong” model can sometimes predict better than the correct one.

We could derive predicted outcomes for a new dataset, like the Kaggle test data here, but there's an issue with single classification trees that decreases their utility as classifiers. What's the problem? Here is Strobl and colleagues:

The main flaw of simple tree models is their instability to small changes in the learning data: In recursive partitioning, the exact position of each cutpoint in the partition, as well as the decision about which variable to split in, determines how the observations are split up in new nodes, in which splitting continues recursively. However, the exact position of the cutpoint and the selection of the splitting variable strongly depend on the particular distribution of observations in the learning sample. Thus, as an undesired side effect of the recursive partitioning approach, the entire tree structure could be altered if the first splitting variable, or only the first cutpoint, was chosen differently because of a small change in the learning data. Because of this instability, the predictions of single trees show a high variability.

Bagging and random forests

So single trees are prone to high variance predictions (i.e., overfitting). Did anybody come up with a better idea? Yes. A better approach involves repeatedly growing trees from randomly selected bootstrapped samples of the training data (samples of the same size, but with replacement), and making predictions based on the whole ensemble of trees. This bootstrap aggregating, or bagging, is a popular ensemble approach. There is another method that aggregates even more diverse trees. In addition to the growing several trees from bootstrapped samples, each tree could be made to choose from a randomly restricted subset of all available features. This ensemble approach is called a random forest. Among the parameters of random forests are the number of trees to grow, and the number of randomly selected features to choose from. Theoretical and empirical evidence suggests that individual classification trees are unbiased (but high variance) predictors, so these ensemble methods mitigate the high variance by growing many trees.

Let's build a bagged ensemble, which we can implement in cforest by setting mtry to Inf (i.e., the full set of available features).
We'll also build a random forest, using the default for 'mtry'. We'll grow 500 trees (the default) for each model.

m10 <- cforest(Survived ~ ., data = train, ntree = 500, mtry = Inf)
m11 <- cforest(Survived ~ ., data = train, ntree = 500)

So where's the tree? Well, while there are ways to plot one of the 500 trees that make up the ensemble, it doesn't make sense like it did with the single trees we grew earlier. Ensemble classifiers use “votes” from hundreds of independently sampled trees, each of which has only seen a subset of all the training data, and a subset of all available features, to settle on a predicted outcome. So no single tree in the ensemble will neccessarily give us a sense of the forest. The main goal here though is prediction, so let's do that. Actually, we don't have access to a test set that the models haven't seen before. But we can use the imperfect alternative of out-of-bag (OOB) error estimation. Basically, the method is to repeatedly evaluate predictions from a given tree on held-out data that wasn't available for growing that particular tree (i.e., data that was not in the bootstrap sample used to grow that tree). Here I'll obtain the confusion matrix for each model (incorporating the OOB prediction error), and call my ClassPerform function to get the accuracy measures.

# OOB classification of true vs. predicted classes
m10.cf <- table(predict(m10, OOB=TRUE), train$Survived)
m11.cf <- table(predict(m11, OOB=TRUE), train$Survived)

# Bagging
ClassPerform(m10.cf)
##    ClassAcc     Kappa
## 1 0.8451178 0.6640437

# Random forest
ClassPerform(m11.cf)
##    ClassAcc     Kappa
## 1 0.8361392 0.6465669

Kaggle submission

In comparison with the repeated 10-fold cross-validated prediction accuracy of my best logistic regression model in the previous post, these aren't any better, maybe a bit worse. But I can't help submit one of these models' predictions to Kaggle, to see how the model fares on a truly new dataset. I'll submit the predictions from the ensemble of bagged trees, but now I'll make predictions from the model trained on all of the available data (by setting OOB to FALSE).

Survived <- as.factor(predict(m10, test, OOB=FALSE))
submission <- data.frame(PassengerId = test$PassengerId, Survived=Survived)

#Write to .csv
write.csv(submission, 'bag.csv', row.names = FALSE)

How did I do?

plot of chunk unnamed-chunk-15

It's an improvement on the logistic regression model. Less than 0.5% improvement, but I nonetheless moved up 343 spots on the leaderboard. The margins of success are increasingly small as prediction accuracy gets higher. There is a lot more we can do with ensemble methods in partykit, but there are so many other kinds of predictive models out there with their own R packages. If only there was a way to access them all, and their associated methods, in an integrated framework… oh wait there is a way.

Modeling with the caret package

The caret (Classification And Regression Training) package provides a standardized interface to hundreds of different R functions useful for predictive modeling. The author of this caret co-authored the book Applied Predictive Modeling, in which caret features heavily.

At the heart of caret is the train function, which can 1) estimate model performance from a training set; 2) use resampling approaches to evaluate the effects of model tuning parameters on model performance; 3) choose the optimal model across these parameters. So far we've done some of the former, but neither of the latter. Here is pseudocode for the general train algorithm:

  1. Define model parameters to evaluate.
  2. FOR each parameter:
  3. ….FOR each resampling iteration:
  4. …….Hold-out 'test' data
  5. …….Fit model on remainder of 'training' data
  6. …….Predict the outcome variable on the test data
  7. ….END
  8. ….Compute mean accuracy metrics across test sets
  9. END
  10. Determine optimal parameters.
  11. Fit the best model to the full training dataset.
  12. Optionally derive predictions for new test dataset.

Ok so let's continue with the ensemble approach from above. We'll build a random forest of conditional inference trees (i.e., cforest function), but this time we'll use train to systematically evaluate the influence of the parameter 'mtry' (# of randomly selected predictors).

First, we choose a resampling method and define its parameters. To maintain consistency with what I've done so far, I'll use repeated 10-fold cross validation (repetition = 5).

require(caret)

RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=5,
  verboseIter = FALSE, allowParallel = TRUE)

Now we call train, which takes the model formula, data, estimation method, and the train control object as arguments. How does train select which levels of parameters to evaluate? By default, if p is the number of tuning parameters, the parameter grid size is 3p^ The user can also input a custom grid, though, which is what I elected to do here to sample more of the parameter space.

customGrid <-  expand.grid(mtry=seq(3,27,by = 2))
t1 <- train(Survived ~ ., data = train, method="cforest", 
  trControl = RfoldCV,  tuneGrid=customGrid)

Let's examine some of the contents of the returned object, starting with the classification accuracy at different levels of the 'mtry' parameter.

print(t1, details=T)
## Conditional Inference Random Forest 
## 
## 891 samples
##  15 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 803, 802, 802, 802, 802, 801, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##    3    0.8103584  0.5748504  0.04065178   0.09170079
##    5    0.8202288  0.6095693  0.04218561   0.09216546
##    7    0.8258343  0.6222412  0.04172612   0.09090105
##    9    0.8247132  0.6179529  0.04252162   0.09278510
##   11    0.8262812  0.6202889  0.04386115   0.09686357
##   13    0.8262787  0.6194249  0.04294836   0.09550746
##   15    0.8267382  0.6199570  0.04319338   0.09579875
##   17    0.8271826  0.6202590  0.04398649   0.09809832
##   19    0.8262812  0.6183850  0.04236671   0.09445320
##   21    0.8265033  0.6185649  0.04280261   0.09552688
##   23    0.8240339  0.6134755  0.04264691   0.09466191
##   25    0.8240162  0.6133688  0.04448782   0.09864004
##   27    0.8244782  0.6142466  0.04353985   0.09642757
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 17. 
## 
## ----------------------------------------------------------
## 
## The final model:
## 
## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  500 
## 
## Response:  .outcome 
## Inputs:  Pclass2, Pclass3, PassengerId, Sexmale, Age, SibSp, Parch, Fare, EmbarkedQ, EmbarkedS, median.hon.age, median.class.fare, honorificTypeMiss, honorificTypeMr, honorificTypeMrs, honorificTypeNoble, CabinAlphaUnoB, CabinAlphaUnoC, CabinAlphaUnoD, CabinAlphaUnoE, CabinAlphaUnoF, CabinAlphaUnoG, CabinAlphaUnoT, CabinAlphaUnoU, FamSize, Fare.pp, wcFirstyes 
## Number of observations:  891

The optimal forest emerged when several but not all predictors were made available. In addition to the mean accuracy metrics, the train object also contains the individual metrics from each prediction (i.e., number of folds x repetitions) for the optimal model. We can get a sense of the variability (and sampling distribution) of each measure by plotting histograms.

hist(t1$resample$Accuracy, col="grey")

plot of chunk unnamed-chunk-19

hist(t1$resample$Kappa, col="green")

plot of chunk unnamed-chunk-19

We also can inspect performance across the parameter space (in this case, just one parameter) with a ggplot method.

ggplot(t1)

plot of chunk unnamed-chunk-20

Kaggle submission

Our best Kaggle submission so far was a bagged ensemble of trees, but the above plot suggests that with the current set of predictors, a random forest with a subset of most but not all randomly selected predictors is more accurate than when all predictors are available. Let's make a prediction.

The predict.train function takes a train object as its first argument, which contains only the optimal model. That is, you should not supply train$finalModel to this argument.

Survived <- as.factor(predict.train(t1, test))
submission <- data.frame(PassengerId = test$PassengerId, Survived=Survived)

#Write to .csv
write.csv(submission, 'rand_forest.csv', row.names = FALSE)

How did we do?

plot of chunk unnamed-chunk-22

Actually, a significant improvement. We used the same set of candidate features that were available in my previous submissions, but this time caret helped us zero in on a better ensemble (random forest) model. This is a subtantial jump on the leaderboard, and we're now within the top 25% of submissions. So how to improve? We could perform more comprehensive comparisons of this model with several other models available in R and the caret interface. But actually I'm not convinced that we've extracted the best set of features from the out of the box training data provided by Kaggle. In the next and final post I'll start from scratch from the Kaggle data, engineer a few more more features, fit a wider variety of models, and hopefully end with my best submission.

Predictive modeling: Kaggle Titanic competition (part 1)

The Kaggle Titanic competition is a great way to learn more about predictive modeling, and to try out new methods. So what is Kaggle? It is “the world's largest community of data scientists. They compete with each other to solve complex data science problems, and the top competitors are invited to work on the most interesting and sensitive business problems from some of the world’s biggest companies…” Kaggle provides a few “Getting Started” competitions with highly structured data, including this one. The goal here is to predict who survived the Titanic disaster and who did not based on available information.

In this post and a subsequent post I'll share my journey through data preprocessing, feature engineering, data visualization, feature selection, cross-validation, model selection, and a few submissions to the competition. Along the way I relied many times for ideas and sometimes code, on a few excellent existing R tutorials for the Titanic prediction challenge, including this one and this one.

Data Preprocessing

Before any modeling or predictive analysis, we need the data in a suitable format. Preprocessing includes assessing and potentially changing variable types, dealing with missing data, and generally getting to know your data. Whenever possible I try to write generic functions that take a dataframe argument, because eventually whatever preprocessing I do on the training data, I'll eventually do on the testing data.

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, collapse=TRUE, error=TRUE)

#Read training data that I downloaded from Kaggle website
#the 'NA' and blank arguments to 'na.strings' ensures proper treatment of missing data
train <- read.csv('train.csv', header=TRUE, na.strings = c("NA",""))

#function to change certain variable types
changeVars = function(df){

#change ordinal variables coded as numeric to factors
df$Pclass <- as.factor(df$Pclass)

#change text strings (coded as factors) to character type
df$Ticket <- as.character(df$Ticket)
df$Cabin <- as.character(df$Cabin)
df$Name <- as.character(df$Name)

#return new data frame
return(df)
}

#Update training dataset
train <- changeVars(train)

Most datasets have missing data. We'll need to identify any missing data, and then decide how to deal with them. Let's check for missing values with the Amelia package 'missmap' function.

require(Amelia)
missmap(train, col=c("green", "black"), legend=F)

plot of chunk unnamed-chunk-1

#Cabin is mostly missing, and Age contains 177 missing values
summary(train$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

Let's impute the missing age values by looking at the honorific for each passenger, which are embedded between a comma and period in the Name column. Write a function that returns the original df with a new column of honorifics attached, as well.

require(plyr)
Honorifics = function(df) {
#first remove characters following period
lastnames <- sub('\\..*', '', df$Name)

#then remove characters preceding the comma
honorific <- sub('.*\\,', '', lastnames)

#finally return char. vector w/o leading or trailing whitespace
#and attach back to data. This might actually be a useful feature later.s
df$honorific  <- as.factor(gsub("^\\s+|\\s+$", "", honorific))

#get the median age for each group
tmp <- ddply(df, .(honorific),
  summarize, median.hon.age = round(median(Age, na.rm = T),0))

#if there are sole honorifics, and that person has no age value, 
#we're out of luck so let's just assume the average age of the 
#age column for that dataset.
tmp$median.hon.age[which(is.na(tmp$median.hon.age))] <- round(mean(df$Age, na.rm = TRUE), 0)

#merge data frames
tmp2 <- merge(df, tmp)

#replace NAs with median 'honorific' age
tmp2$Age <- ifelse(is.na(tmp2$Age), tmp2$median.hon.age, tmp2$Age)

#return new dataframe
return(tmp2)
}

#update training data and inspect results of new Age variable
train <- Honorifics(train)
summary(train$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   21.00   30.00   29.39   35.00   80.00

The 'Embarked' variable also has a few missing entries, but since there are so few let's just replace them with the most common embarked location: 'S'.

aggregate(Survived ~ Embarked, data=train, FUN=sum)
##   Embarked Survived
## 1        C       93
## 2        Q       30
## 3        S      217
aggregate(Survived ~ Embarked, data=train, FUN=mean)
##   Embarked  Survived
## 1        C 0.5535714
## 2        Q 0.3896104
## 3        S 0.3369565
summary(train$Embarked)
##    C    Q    S NA's 
##  168   77  644    2

ImputeEmbark = function(df) {
  df$Embarked[which(is.na(df$Embarked))] <- 'S'
  return(df)
}
train <- ImputeEmbark(train)

Fare also needs a bit of work due to fares of $0.00. I'll replace $0.00 with the median fare from that class.

summary(train$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.30

ImputeFare = function(df){
 tmp <- ddply(df, .(Pclass),
  summarize, median.class.fare = round(median(Fare, na.rm = T),0))

 #attach to data and return new df
 tmp2 <- merge(df, tmp)
 tmp2$Fare <- ifelse(df$Age==0, df$median.class.fare, df$Age)
 return(tmp2)
}

#update training data and inspect results
train <- ImputeFare(train)
summary(train$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   21.00   30.00   29.39   35.00   80.00

A touch of feature engineering

Now let's take a closer look at these honorifics, because this might be a useful feature. In other words there is potential here for feature engineering: creating new features/variables/predictors from the raw data. This is a crucial aspect of predictive modeling, and data analysis generally. No matter how fancy your model, garbage in, garbage out. Here I'll create a new feature containing honorific classes.

boxplot(train$Age ~ train$honorific)

plot of chunk unnamed-chunk-5

sort(table(train$honorific))
## 
##         Capt          Don     Jonkheer         Lady          Mme 
##            1            1            1            1            1 
##           Ms          Sir the Countess          Col        Major 
##            1            1            1            2            2 
##         Mlle          Rev           Dr       Master          Mrs 
##            2            6            7           40          125 
##         Miss           Mr 
##          182          517

#We don't want a feature with single instances so let's subsume the single
#honorifics into sensible existing categories with a function
HonorificFeatureBuild = function(df){

  #create a new column instead of replacing the
  #existing 'honorific' column (for sake of completeness)
  df$honorificType <- df$honorific

  #Function to replace old honorific(s) with new one
  replace.string <- function(df, original, replacement) {
    for (x in original) {
      levels(df$honorificType)[levels(df$honorificType)==x] <- replacement
    }
    return(df$honorificType)
  }

  #Combine 'Ms' and 'Dona', which I noticed occurs in the test set, with 'Mrs'
  df$honorificType <- replace.string(df, c('Dona', 'Ms'), 'Mrs')

  #Combine Madam, Mademoiselle, to equivalent: Miss
  df$honorificType <- replace.string(df, c('Mme','Mlle'), 'Miss')

  #Combine several titles of upper class people into 'Noble'
  df$honorificType <- replace.string(df,
    c('Sir','Capt','Rev','Dr','Col','Don',
      'Major','the Countess', 'Jonkheer', 'Lady'), 'Noble')

  #Remove dropped levels
  df$honorificType <- droplevels(df$honorificType)

  #Return new data frame
  return(df)
}

#Return new dataframe and inspect new variable
train <- HonorificFeatureBuild(train)
sort(table(train$honorificType))
## 
##  Noble Master    Mrs   Miss     Mr 
##     23     40    126    185    517
boxplot(train$Age ~ train$honorificType)

plot of chunk unnamed-chunk-5

Mosaic plots are a good way to visualize categorical data, with the height of the bars representing respective proportions and the width representing the size of the group/factor.

require(vcd)
mosaicplot(train$honorificType ~ train$Survived, shade=F, color=T)

plot of chunk unnamed-chunk-6

It looks like this new feature might be useful; we'll see soon enough. But first a little more work in the preprocessing stage. Let's look at the 'Cabin' variable and see about creating another feature there.

CabinFeatures = function(df){
  df$Cabin[which(is.na(df$Cabin))] <- 'Unknown'
  df$CabinAlpha <- gsub('[[:digit:]]+', "", df$Cabin)

  #some have multiple entries; let's just keep the first one
  df$CabinAlphaUno <- sub("^(\\w{1}).*$", "\\1", df$CabinAlpha)

  #cabins B, D, E look good and U looks pretty bad
  #but do these provide any more information than Pclass?
  table(df$Pclass, df$CabinAlphaUno)
  #yes, so let's make GoodCabin and BadCabin features
  df$GoodCabin <- as.factor(
    ifelse((df$CabinAlphaUno=='B' |
      df$CabinAlphaUno=='D' |
      df$CabinAlphaUno=='E'),
      'yes','no'))
  df$BadCabin <- as.factor(ifelse(df$CabinAlphaUno=='U', 'yes', 'no'))
  #return new data frame
  return(df)
}

#Update df with new Cabin information
train <- CabinFeatures(train)

Finally, there are a few more sensible features to add, and preprocessing steps, before we're ready to build models. (Of course we might end up back at this stage later on, for example if we think of a great new feature to extract from the raw training data.)

AddFinalFeatures <- function(df) {

  #Consolidate #siblings (SibSp) and #parents/children (Parch)
  #into new family size (FamSize) feature
  df$FamSize <- df$SibSp + df$Parch

  #Adjust fare by size of family (per person)
  df$Fare.pp <- df$Fare / (df$FamSize + 1)

  #'wcFirst' is for the "women and children first" policy for lifeboats
  #(https://en.wikipedia.org/wiki/Women_and_children_first).
  #I'll assume that teenage boys were treated as adults
  df$wcFirst <- 'no'
  df$wcFirst[which(df$Sex == "female" | df$Age < 13)] <- 'yes'
  df$wcFirst <- as.factor(df$wcFirst)
  return(df)
}

#Update the df
train <- AddFinalFeatures(train)

Last, let's remove any unneccessary columns which will make it easier to define models that include all variables in the data frame.

CleanDF = function(df){
  delete <- c("honorific", "Name", "SibSp", "Parch", "Ticket", "Fare", "Cabin",
              "median.hon.age", "median.class.fare", "CabinAlpha", "CabinAlphaUno")
  return(df[, !(names(df) %in% delete)])
}

#Here is our preprocessed training data
train <- CleanDF(train)

#Write to disk for later use
write.csv(train, "curated_train_data.csv", row.names = F)

Prepare the test data

Now we're ready to fit models to the training data. Eventually though to submit predictions to Kaggle, we'll need the test data in the same format as our new training data. So here I read the test data and then call all the preprocessing functions (as usual, generic functions are the way to go to avoid repetition).

test <- read.csv('test.csv', header=TRUE, na.strings = c("NA",""))
test <- changeVars(test)
test <- Honorifics(test)
test <- ImputeEmbark(test)
test <- ImputeFare(test)
test <- HonorificFeatureBuild(test)
test <- CabinFeatures(test)
test <- AddFinalFeatures(test)
test <- CleanDF(test)

#Write to disk for later use
write.csv(test, "curated_test_data.csv", row.names = F)

Before we continue let's make sure there were no issues with the outputs of any of these functions. When I eventually submit to Kaggle, I need to fit my favored model to the test set, which means the test set can't have any variables, or levels of variables, that it hasn't seen in the training data. There should not be any missing data either.

missmap(test, col=c("green", "black"), legend=F)

plot of chunk unnamed-chunk-11

Looks good. Now we're ready to build the first model.

Model Building

What's the goal of predictive modeling? To settle on a statistical model of the outcome variable as some function of our available features. In this case the outcome variable is binary and categorical: whether a person survived the disaster or not. This is a binary classification problem, and there are several ways to approach it.

Model 1: Logistic regression (simple model)

A common approach to binary classification is the logistic regression model. Logistic regression is not a binary classification method per se, because it estimates the probability of a binary response–ranging continuously from 0 to 1. But by simply rounding the estimated probability to the closest integer (0 or 1) we can use logistic regression as a binary classifier. To start, let's fit a logistic regression model with 3 features I think will be important.

m1.glm <- glm(Survived ~ FamSize + Sex + Pclass, data=train, family='binomial')
summary(m1.glm)
## 
## Call:
## glm(formula = Survived ~ FamSize + Sex + Pclass, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2738  -0.7282  -0.4709   0.5903   2.4869  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.50663    0.23843  10.513  < 2e-16 ***
## FamSize     -0.15038    0.05983  -2.514 0.011953 *  
## Sexmale     -2.77670    0.19498 -14.241  < 2e-16 ***
## Pclass2     -0.84771    0.24661  -3.437 0.000587 ***
## Pclass3     -1.87347    0.21522  -8.705  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  820.12  on 886  degrees of freedom
## AIC: 830.12
## 
## Number of Fisher Scoring iterations: 4

Consistent with my hunch about these features, they're all statistically significant. But to get a better sense of the model and its predictions, let's inspect the predicted/fitted values at all levels of the model's features. We apply the predict function to the model along with the chosen levels of the predictor variables in 'newdata'. The default for 'type' is on the scale of the linear predictors (log-odds: probabilities on logit scale). Here we'll choose the alternative option “response” (the scale of the response variable), which gives the predicted probabilities–a more transparent unit to interpret (at least for me).

newdata <- data.frame(
  FamSize=rep(0:7, 6),
  Sex=rep(sort(rep(c('male','female'),8)),6),
  Pclass=as.factor(sort(rep(c('1','2','3'), 16))))

#Add predicted probs
newdata$PredProb <- predict(m1.glm, newdata, type = "response")

#Plot them at all feature levels
require(ggplot2)
ggplot(newdata, aes(x=FamSize, y=PredProb, color=Pclass)) +
  geom_line(aes(group=Pclass)) +
  geom_point() +
  geom_hline(yintercept = .5) +
  facet_grid(. ~ Sex)

plot of chunk unnamed-chunk-13

The model predicts survivors to be female, in smaller families, and in higher classes. It captures trends in the data, but it wrongly predicts no male survivors when we know that over 100 males did survive. So how well does this model predict the actual outcomes?

First we create a confusion matrix, from which we can compute different measures of classification performance. There are several ways to assess classification performance. Two commonly used methods are “Classification (Observed) Accuracy” and Cohen's Kappa (originally conceived for inter-rater agreement).

Accuracy is just how often the classifier is correct. Kappa measures how well the classifier performed in comparison to how well it would have performed by chance. A model with a high Kappa reflects a large discrepancy between the accuracy (how often does classifier correctly predict outcome) and the null error rate (how often model would be wrong if it always predicted the majority class).

#Attach the model's predicted outcomes to the existing data frame
train$predictedSurvival <- round(m1.glm$fitted.values)

#First we make the confusion matrix containing the proportions of 'yes' and 'no,
#where actual outcomes are rows and predicted outcomes are columns.
m1.cm <- as.matrix(table(train$Survived, train$predictedSurvival))

#Function that returns our accuracy measures
#it takes a confusion matrix as its argument
ClassPerform = function(cm){

  #Classification (observed) accuracy 
  #INn signal detection terminology:
  #[h=hits; cr=correct rejections; fa=false alarms; m=misses]
  #(h + cr) / (h + cr + fa + m)
  #Or ... 
  #TruePositive + TrueNegative / Total
  obs.acc <- (cm[1,1] + cm[2,2]) / sum(cm)

  #k = (observed accuracy - random accuracy) / (1 - random accuracy)
  #we already have observed accuracy, but we need random accuracy,
  #the probability of agreement expected by chance from the observed data
  #and the model's predictions

  ObservedNegative  <- (cm[1,1] + cm[1,2]) / sum(cm)
  PredictedNegative <- (cm[1,1] + cm[2,1]) / sum(cm)
  ObservedPositive  <- (cm[2,1] + cm[2,2]) / sum(cm)
  PredictedPositive <- (cm[1,2] + cm[2,2]) / sum(cm)
  RandomACC <- (ObservedNegative*PredictedNegative) + (ObservedPositive*PredictedPositive)
  kappa <- (obs.acc - RandomACC) / (1 - RandomACC)
  #Return both measures
  return(data.frame(ClassAcc = obs.acc, Kappa = kappa))
}

So how well does model 1 predict the actual outcomes in the training data?

m1.performance <- ClassPerform(m1.cm)
print(m1.performance)
##    ClassAcc    Kappa
## 1 0.8002245 0.566665

Not too bad. Kappa ranges between -1 and 1, where scores of zero or lower signify no agreement (random relationship) between classifier and reality. Interpretation depends on the field and problem type, but one offered rule of thumb is > 0.75 excellent, 0.40-0.75 fair to good, < 0.40 poor. Let's keep going, we can do better.

Model 2: Logistic regression (saturated model)

Let's use the same type of model but assume that all our features are good ones that should be included in the model.

#remove the predicted outcomes from before
train$predictedSurvival <- NULL

#Fit a logistic regression model with all the features (excluding passenger ID)
m2.glm <- glm(Survived ~ . - PassengerId, data=train, family='binomial')
summary(m2.glm)
## 
## Call:
## glm(formula = Survived ~ . - PassengerId, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4023  -0.5598  -0.3756   0.5518   2.5263  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          2.906e+01  1.014e+03   0.029  0.97713    
## Pclass2             -5.774e-01  4.103e-01  -1.407  0.15937    
## Pclass3             -1.678e+00  4.136e-01  -4.057 4.97e-05 ***
## Sexmale             -2.790e+01  1.014e+03  -0.028  0.97804    
## Age                 -3.198e-02  9.882e-03  -3.236  0.00121 ** 
## EmbarkedQ           -2.172e-01  3.946e-01  -0.550  0.58203    
## EmbarkedS           -5.316e-01  2.493e-01  -2.132  0.03300 *  
## honorificTypeMaster  1.622e+01  8.827e+02   0.018  0.98534    
## honorificTypeMiss   -1.201e+01  4.983e+02  -0.024  0.98077    
## honorificTypeMr     -1.320e-01  6.015e-01  -0.219  0.82626    
## honorificTypeMrs    -1.107e+01  4.983e+02  -0.022  0.98228    
## GoodCabinyes         8.652e-01  3.850e-01   2.247  0.02463 *  
## BadCabinyes         -4.789e-01  3.943e-01  -1.215  0.22454    
## FamSize             -4.299e-01  9.330e-02  -4.608 4.06e-06 ***
## Fare.pp              1.364e-03  9.228e-03   0.148  0.88249    
## wcFirstyes          -1.314e+01  8.827e+02  -0.015  0.98813    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  715.58  on 875  degrees of freedom
## AIC: 747.58
## 
## Number of Fisher Scoring iterations: 13

The AIC model fit is substantially lower than the simpler model, despite this model having several more terms (AIC penalizes more complexity/parameters). How well does this model predict the outcomes in the training data?

train$predictedSurvival <- round(m2.glm$fitted.values)
m2.cm <- as.matrix(table(train$Survived, train$predictedSurvival))
m2.performance <- ClassPerform(m2.cm)
print(m2.performance)
##    ClassAcc    Kappa
## 1 0.8372615 0.654238

Better than before. But inspection of the Coefficients suggests there may be some uninformative features. One way to get a sense of which ones these might be is to sequentially compare more and more complex models with likelihood ratio tests. This is easy to do with the 'anova' function. But note that the terms are added in the same order as specified in the model, so I need to refit the model placing the terms in what I think is the order of importance based on my domain knowledge and exploratory analyses I've conducted so far.

m2.glm.ordered <- glm(Survived~Sex+Pclass+Age+FamSize+honorificType+
 GoodCabin+BadCabin+Embarked+Fare.pp+wcFirst, data=train, family="binomial")
#Analysis of Deviance 
deviance.table <- anova(m2.glm.ordered, test="Chisq")
print(deviance.table)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived
## 
## Terms added sequentially (first to last)
## 
## 
##               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            890    1186.66              
## Sex            1  268.851       889     917.80 < 2.2e-16 ***
## Pclass         2   90.916       887     826.89 < 2.2e-16 ***
## Age            1   24.176       886     802.71 8.791e-07 ***
## FamSize        1   14.113       885     788.60 0.0001721 ***
## honorificType  4   56.882       881     731.72 1.310e-11 ***
## GoodCabin      1    9.643       880     722.07 0.0019008 ** 
## BadCabin       1    1.290       879     720.78 0.2559915    
## Embarked       2    4.758       877     716.02 0.0926273 .  
## Fare.pp        1    0.017       876     716.01 0.8968401    
## wcFirst        1    0.426       875     715.58 0.5139383    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Visualize the reduction in the residual deviance
plot(deviance.table$`Resid. Dev`, type="o")

plot of chunk unnamed-chunk-18

From the deviances and significance levels of each feature, it looks like we're justified in dropping 'BadCabin', 'Fare.pp', and 'wcFirst'.

Model 3: Logistic regression (middle ground model)

Let's fit a simpler 3rd model without these features, with the hope that it represents a middle ground between complexity and predictive power.

m3.glm <- glm(Survived~Sex+Pclass+Age+FamSize+honorificType+GoodCabin+Embarked, data=train, family="binomial")
summary(m3.glm)
## 
## Call:
## glm(formula = Survived ~ Sex + Pclass + Age + FamSize + honorificType + 
##     GoodCabin + Embarked, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3017  -0.5664  -0.3772   0.5486   2.5202  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          15.837211 492.786269   0.032  0.97436    
## Sexmale             -14.886862 492.786370  -0.030  0.97590    
## Pclass2              -0.845554   0.324104  -2.609  0.00908 ** 
## Pclass3              -1.957946   0.306748  -6.383 1.74e-10 ***
## Age                  -0.031807   0.009785  -3.251  0.00115 ** 
## FamSize              -0.432291   0.080284  -5.385 7.26e-08 ***
## honorificTypeMaster   3.124046   0.797886   3.915 9.03e-05 ***
## honorificTypeMiss   -12.096458 492.786161  -0.025  0.98042    
## honorificTypeMr      -0.104168   0.586068  -0.178  0.85893    
## honorificTypeMrs    -11.162869 492.786169  -0.023  0.98193    
## GoodCabinyes          1.069362   0.346830   3.083  0.00205 ** 
## EmbarkedQ            -0.202543   0.395174  -0.513  0.60827    
## EmbarkedS            -0.509562   0.248663  -2.049  0.04044 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  717.53  on 878  degrees of freedom
## AIC: 743.53
## 
## Number of Fisher Scoring iterations: 13

As I suspected, the AIC is lower because of the penalization factor. How well does this 3rd model predict the outcomes in the training data?

train$predictedSurvival <- round(m3.glm$fitted.values)
m3.cm <- as.matrix(table(train$Survived, train$predictedSurvival))
m3.performance <- ClassPerform(m3.cm)
print(m3.performance)
##    ClassAcc    Kappa
## 1 0.8383838 0.656051

Very close to the saturated model, but a touch more accurate. But given the reduced complexity I favor this model substantially over the saturated model.

Out-of-sample Prediction

So far we've assessed each model's explanatory power by inspecting in-sample model fit. But ultimately we're after out-of-sample prediction! We need to settle on a good model that can best predict something it has never seen before… hence the word 'predict'. Let's use cross-validation to estimate how well each model performs on data it was not trained on, and therefore shouldn't overfit. I think that the AIC values we saw earlier should give us a hint about out of sample prediction performance, because minimizing the AIC is (asymptotically) equivalent to minimizing at least one type of cross validation value (Stone 1977). Although there are fast and efficient ways to implement what I do below (which I'll rely on later), it helped me understand the process by coding it (mostly) from scratch.

k-fold cross validation

I'll create a function that performs k-fold CV and returns both performance measures we saw earlier: accuracy and kappa. The function takes three arguments: a data frame, the number of folds, and a model object. The basic idea is to split the data into k chunks, and use each chunk once as the validation data. The model trained on the remaining k-1 chunks tries to predict the outcomes in the validation set, and the resulting performance is aggregated across the k chunks.

set.seed(99) # make the results repeatable from here on out
kFoldCV <- function(k, mod){

  # get the model data frame
  df <- mod$model

  #Randomly partition data into k folds
  #Use of sample here means we can repeat this function and get a random 
  #set of folds each time the function is called
  folds <- cut(sample(seq_len(nrow(df))),  breaks=k, labels=F)
  accuracy.mat <- matrix(nrow = k, ncol=2)

  for (i in 1:k) {
  #Create training set from the k-1 subsamples
  trainFold <- subset(df, folds != i)  

  #Create the test set
  validationFold <- subset(df, folds == i)

  #Use the fitted model on the test set
  predValues <- predict(mod, newdata = validationFold, type="response")

  #Classify predicted values into binary outcome
  classifier <- ifelse(predValues < 0.5, 0, 1)

  #Make confusion matrix 
  # (get the name of the outcome variable from the model object)
  y <- validationFold[,names(mod$model)[1]]
  confusion.mat <- as.matrix(table(y, classifier))

  #Store classification accuracy measures for that run using 
  #'ClassPerform' function created earlier
  accuracy.mat[i, ] <- t(ClassPerform(confusion.mat))
 }
  #Return both measures in long format
  return(data.frame(Measure=c('ClassAcc','Kappa'), 
    value=c(mean(accuracy.mat[,1]), mean(accuracy.mat[,2]))))
}

Now let's compute 10-fold CV for each model and inspect the results.

m1.ACC <- kFoldCV(k = 10, m1.glm)
m2.ACC <- kFoldCV(k = 10, m2.glm)
m3.ACC <- kFoldCV(k = 10, m3.glm)

print(m1.ACC)
##    Measure     value
## 1 ClassAcc 0.8002747
## 2    Kappa 0.5673504
print(m2.ACC)
##    Measure     value
## 1 ClassAcc 0.8372534
## 2    Kappa 0.6537724
print(m3.ACC)
##    Measure    value
## 1 ClassAcc 0.838377
## 2    Kappa 0.653931

It looks like model 3 has the slight advantage in both cross validated measures. Ok now before continuing, I was curious about the relationship between the number of folds and predictive accuracy. So I ran several CVs at fold sizes from 2 to 60 for model 3, and plotted the relationships.

numFolds <- data.frame(matrix(nrow=59, ncol=2))
for (f in 2:60){
  numFolds[f-1, ] <- kFoldCV(f, m3.glm)[, 2]
}

Does number of folds influence observed accuracy? Doesn't look like it, but the variance of observed accuracy seems to increase with k.

x <- (2:60)
y1 <- numFolds[,1]
y2 <- numFolds[,2]
plot(x, y1, xlab='No. folds', ylab='Accuracy')
lo <- loess(y1~x)
lines(predict(lo), col='red', lwd=2)

plot of chunk unnamed-chunk-24

Does number of folds influence Kappa? Actually yes, it looks like the more folds, the lower the kappa, and the higher the variance. Neither relationship suggests using more than 10 folds–at least for these data (of these dimensions).

plot(x, y2, xlab='No. folds', ylab='Kappa')
lo <- loess(y2~x)
lines(predict(lo), col='red', lwd=2)

plot of chunk unnamed-chunk-25

Repeated k-fold cross validation

In k-fold cross validation on a dataset of this size, there are many many ways to cut up the training data into k folds. But we only implemented one of these ways, and it was randomly selected. To account for the random variability introduced by this selection, statisticians recommend repeating the 10-fold CV process several times and deriving average accuracy measures from each replicate. Let's repeat the validation 1000 times.

reps = 1000 # number of times to replicate the random k selection process
m1.reps <- do.call(rbind, lapply(X=1:reps, FUN=function(x) kFoldCV(10, m1.glm)))
m2.reps <- do.call(rbind, lapply(X=1:reps, FUN=function(x) kFoldCV(10, m2.glm)))
m3.reps <- do.call(rbind, lapply(X=1:reps, FUN=function(x) kFoldCV(10, m3.glm)))

repeatedPerform <- rbind(m1.reps, m2.reps, m3.reps)
repeatedPerform$mod <- sort(rep(c('m1','m2','m3'), reps*2))

#create data frame with means of repeated CV for each measure and model
repCVperformance <- ddply(repeatedPerform, .(mod, Measure), summarize, value = mean(value))
#arrange columns in same order as usual
repCVperformance <- repCVperformance[,c(2,3,1)]
print(repCVperformance)
##    Measure     value mod
## 1 ClassAcc 0.8002219  m1
## 2    Kappa 0.5638397  m1
## 3 ClassAcc 0.8372607  m2
## 4    Kappa 0.6515769  m2
## 5 ClassAcc 0.8383829  m3
## 6    Kappa 0.6534456  m3

I placed all the performance measures–from the original in-sample model fits, and from the single case and the repeated cross validations, in a dataframe that we'll use for plotting below.

require(reshape)
inSample <- melt(rbind(m1.performance, m2.performance, m3.performance), variable_name = "Measure")
inSample$mod <- rep(c('m1','m2','m3'), 2)
inSample$Method <- "InSample"

singleCV <- rbind(m1.ACC, m2.ACC, m3.ACC)
singleCV$mod <- sort(rep(c('m1','m2','m3'), 2))
singleCV$Method <- "SingleCV"

repCVperformance$Method <- "RepeatedCV"

# combine all into one df
allMeasures <- rbind(inSample, singleCV, repCVperformance)
allMeasures$mod <- as.factor(allMeasures$mod)
allMeasures$Method <- as.factor(allMeasures$Method)

Let's visualize the variability, across repeated k-fold validations, of both classification performance measures. I'll add vertical lines to each histogram to signify the accuracy measures from the single cross validation run (dashed line), the mean accuracy measures from the repeated cross validation runs (solid line), and also the the original accuracy measures from the in-sample model fit (red line).

histo = function(measure){
  CVreps <- subset(repeatedPerform, Measure==measure)
  Values <- subset(allMeasures, Measure==measure)

  ggplot(CVreps, aes(x = value)) +
    geom_histogram(colour = "grey", fill = "grey", binwidth = 0.0001) +
    facet_grid(. ~ mod, scales = "free") +
    geom_vline(data=Values[Values$Method=='SingleCV', ], aes(xintercept=value), linetype="dashed") +
    geom_vline(data=Values[Values$Method=='RepeatedCV', ],  aes(xintercept=value)) +
    geom_vline(data=Values[Values$Method=='InSample', ],  aes(xintercept=value), color="red") +
    xlab(measure) +
    theme_bw(base_size = 11)
}

#Histogram of mean observed accuracy after repeated k-fold cross validation
histo("ClassAcc")

plot of chunk unnamed-chunk-28

#Histogram of mean Kappa after repeated k-fold cross validation
histo("Kappa")

plot of chunk unnamed-chunk-28

There is sampling variability in both measures, but more so in Kappa. Looking at the x-axis of the Accuracy measure, the variability there is pretty small. Overall though the non-zero variance of both measures, from the in-sample model fit, and from the single cross validation run, is apparent in that the vertical lines aren't identical. Also notice that for Kappa, the in-sample fit led to overstated estimates in comparison with the repeated CV estimates… for every model.

The main point of the single and repeated cross validation, however, was to compare different models so we can choose the best model. Here's a simple plot of the performance measures from the candidate models (2 and 3), but on the same scale this time to facilitate cross-model comparison.

  betterModels <- subset(allMeasures, mod!="m1")
  levels(betterModels$Method) <- c("InSample", "SingleCV", "RepeatedCV")
  ggplot(betterModels, aes(x = mod, y=value, group=Method, color=Method)) +
    geom_point() +
    geom_line() +
    facet_grid(Measure ~ ., scales = "free_y") +
    theme_bw(base_size = 11)

plot of chunk unnamed-chunk-29

Again we see the negligible sampling variability of observed accuracy. Cross validation didn't provide any new information. Kappa varies more, and again we see the inflated estimates from the in-sample model fit. But in this particular case, it looks like every measure points to the same conclusion: model 3 is the best model we have so far. I'll stop this post here then, but not before we see how this model does on the actual Kaggle test data. Time for a submission.

Kaggle Submission

Earlier we prepared the Kaggle test data. So all that's required now is to add a column to 'test' that contains the predicted outcomes according to model 3, remove the other columns (except the index column: PassengerId), write to a .csv file, and drag it onto the Kaggle submission box.

#Predict binary outcomes and attach
test$Survived <- round(predict(m3.glm, test, type = "response"),0)
submission <- data.frame(PassengerId = test$PassengerId, Survived=test$Survived)

#Write to .csv
write.csv(submission, 'model3.csv', row.names = FALSE)

I went to the Kaggle website and dragged my csv file to the submission box. Here's how model #3 fared in predicting the outcomes in the previously unseen test data.

plot of chunk unnamed-chunk-31

What's the performance metric? The score in the Titanic challenge “is the percentage of passengers you correctly predict.” So despite cross validation during model selection, it looks like my model overfit the training data, signified by the substantial decrease in accuracy in the Kaggle test set. But, the eventual winner of a Kaggle competition will be judged on her model's performance on a brand new test set that is held out until the end of the competition. So actually we don't know whether this decrease from training to test will translate into a decrease from training to the real test data.

In the next post I'll try to improve upon this score by delving into the world of recursive partitioning (classification trees, random forest) and by taking advantage of the versatile caret package.

Power simulation in R: The repeated measures ANOVA

In this post I conduct a simulation analysis in R to estimate statistical power: the probability that a statistical test will reject the null hypothesis when it is false. Rather than simulating the data and obtaining p-values several times in a for loop (the old school way to run a simulation), I'll use the split-apply-combine strategy instead, trying out the new-ish R package dplyr, which is a much faster version of its predecesor plyr.

While the simulation approach (and the code) here could be extended to other statistical tests and frameworks, I focus here on a widespread analysis in the natural and social sciences: the 2 x 2 repeated measures (or within-subjects) analysis of variance (ANOVA). In the free and excellent G*Power program, analytical power analyses can be computed for repeated-measures ANOVA, where the user inputs the population effect size directly or estimates it from sums of squares. The reason I made this simulation script, however, is because I wanted to estimate power in repeated measures ANOVA directly from the effect (mu), variance (sigma), and population correlation between measures (rho), expressed in whatever units I'm measuring (response time, neural activity, etc.).

Statistical power is a function of sample size (N), the alpha level, and the population effect size. There are several different kinds of power analyses depending on which parameter is allowed to vary and which ones are set by the experimenter. Here I'll conduct a post hoc power analysis, which is not the same as a retrospective power analysis for the following reason. Let's say you just finished collecting data from a new experiment with an exciting new paradigm. Your approach is so unique that there is no precedent in the literature, and you haven't conducted any pilot studies. A retrospective power analysis in this case would be estimate power from your sample size, alpha level, and obtained effect size. The problem is, you have no idea how close the obtained effect is to the true population effect (assuming there is one), because sample effect sizes are noisy and can even be biased estimators of true effects. If, however, a population effect size can be estimated a priori by consulting the literature or previous studies in your lab, then a post hoc power analysis makes sense to do even, despite its name, before you run the study.

Imagine that I'm planning to conduct a semantic priming study, capitalizing on the well known finding that a word is recognized more quickly when it's preceded by a word with a related versus unrelated meaning. Let's assume I'd already conducted a pilot study with new stimuli–where the prime words referred to tasty foods. From this pilot data, and by consulting results of other priming studies, I know that the baseline response time should be about 700 milliseconds, and the priming effect should be a 30 ms reduction in response time. I also estimate the correlation between repeated measures (r = .75), and the standard deviation of the measures (SD = 150 ms). I'll also introduce a second within-subjects variable: time of test. Participants complete the experiment once on a full stomach, and then again one week later after two days of complete calorie deprivation. The prediction is for a standard priming effect (like before) at first test, but that the tasty prime words (e.g., “fries”, “hamburger”) will so distract the famished participants that any effect of priming will be swamped by a distraction effect–they'll actually be slower to respond to the target words. In other words, I predict no main effects but a significant interaction effect (the coveted double dissociation). But before I begin this study, here's what I want to know: how many subjects will I need to run before I have a reasonably good shot at detecting this interaction? Let's find out.

Define the simulation parameters and build the design matrix

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)

# define the parameters
mu = c(700, 670, 670, 700) # true effects (in this case, a double dissociation)
sigma = 150  # population standard deviation
rho = 0.75 # correlation between repeated measures
nsubs = 25 # how many subjects?
nsims = 5000 # how many simulation replicates?

# create 2 factors representing the 2 independent variables
cond = data.frame(
  X1 = rep(factor(letters[1:2]), nsubs * 2),
  X2 = rep(factor(letters[1:2]), nsubs, each=2))

# create a subjects factor
subject = factor(sort(rep(1:nsubs, 4)))

# combine above into the design matrix
dm = data.frame(subject, cond)

Build Sigma: the population variance-covariance matrix

I have the population standard deviation (sigma) and population correlation (rho) in hand, and I'll assume sphericity of the variance-covariance matrix (as we often do). Therefore the variance-covariance matrix will contain constant covariances on the off-diagonals and constant variances on the diagonals. The variance is just sigma squared, and the covariance can be computed from rho and sigma.

# create k x k matrix populated with sigma
sigma.mat <- rep(sigma, 4)
S <- matrix(sigma.mat, ncol=length(sigma.mat), nrow=length(sigma.mat))

# compute covariance between measures
Sigma <- t(S) * S * rho  

# put the variances on the diagonal 
diag(Sigma) <- sigma^2  

Run the simulation

Now we're ready to initiate the simulation. The empirical power will be the percentage of simulations in which the true effect is detected. I'm using stats::aov to obtain the p-values associated with each of the three effects. Although I usually use ez::ezANOVA for ANOVA analyses in R, a direct call to aov proved about 90% faster in testing.

# stack 'nsims' individual data frames into one large data frame
df = dm[rep(seq_len(nrow(dm)), nsims), ]

# add an index column to track the simulation run
df$simID = sort(rep(seq_len(nsims), nrow(dm)))

# sample the observed data from a multivariate normal distribution
# using MASS::mvrnorm with the parameters mu and Sigma created earlier
# and bind to the existing df

require(MASS)
make.y = expression(as.vector(t(mvrnorm(nsubs, mu, Sigma))))
df$y = as.vector(replicate(nsims, eval(make.y)))             

# use do(), the general purpose complement to the specialized data 
# manipulation functions available in dplyr, to run the ANOVA on
# each section of the grouped data frame created by group_by

require(dplyr)
require(car)
require(broom)

mods <- df %>% 
  group_by(simID) %>% 
    do(model = aov(y ~ X1 * X2 + Error(subject / (X1*X2)), qr=FALSE, data = .)) 

# extract p-values for each effect and store in a data frame
p = data.frame(
  mods %>% do(as.data.frame(tidy(.$model[[3]])$p.value[1])),
  mods %>% do(as.data.frame(tidy(.$model[[4]])$p.value[1])),
  mods %>% do(as.data.frame(tidy(.$model[[5]])$p.value[1])))
colnames(p) = c('X1','X2','Interaction')

Calculate and print the empirical power

The empirical power is easy to compute, it's just the proportion of simulation runs where p <. 05.

power = apply(as.matrix(p), 2, 
  function(x) round(mean(ifelse(x < .05, 1, 0) * 100),0))

With 25 subjects and 5000 simulation runs, the estimated power for each effect is:
Main effect X1: 5%
Main effect X2: 5%
Interaction: 49%

The Type I error rates for the nonexistent main effects are correct (5%). But the power to detect the interaction is pretty low. Let's look at the distributions of p-values.

Visualize the distributions of p-values.

The left plot contains the known effects (mu) with error bars denoting standard errors of the mean. (SEM is not particularly useful for comparing repeated measures, but it'll provide an idea of the variability). The right plot contains three histograms showing the empirical distributions of p-values for the two main effects and the interaction.

# plot the known effects
require(ggplot2)
require(gridExtra)

means = data.frame(cond[1:4, ], mu, SE = sigma / sqrt(nsubs))
plt1 = ggplot(means, aes(y = mu, x = X1, fill=X2)) +
  geom_bar(position = position_dodge(), stat="identity") +
  geom_errorbar(aes(ymin = mu-SE, ymax = mu+SE), 
    position = position_dodge(width=0.9), size=.6, width=.3) +
  coord_cartesian(ylim=c((.7*min(mu)), 1.2*max(mu))) +
  theme_bw()

# melt the data into a ggplot friendly 'long' format
require(reshape2)
plotData <- melt(p, value.name = 'p')

# plot each of the p-value distributions on a log scale
options(scipen = 999) # 'turn off' scientific notation
plt2 = ggplot(plotData, aes(x = p)) +
    scale_x_log10(breaks=c(1, 0.05, 0.001), 
                  labels=c(1, 0.05, 0.001)) +
    geom_histogram(colour = "darkblue", fill = "white") +
    geom_vline(xintercept = 0.05, colour='red') +
    facet_grid(variable ~ .) +
    labs(x = expression(Log[10]~P)) +
    theme(axis.text.x = element_text(color='black', size=7))

# arrange plots side by side and print
grid.arrange(plt1, plt2, nrow=1)

plot of chunk unnamed-chunk-4

The vertical red lines on the p-value histograms are centered at p = 0.05. If there's an effect in the population, then the power is the percentage of simulation runs in which the null hypothesis was correctly rejected. Of course if there's no effect in the population–as exemplified by the two non-existent main effects in this example–then I expect the “power” to be 5%: the nominal Type I error rate.

Our observed power here is only 49%, which is far too low, and actually is a common state of affairs in psychological research (Maxwell, 2004). So obviously we need a larger N to achieve 80% power–an arbitrary but nevertheless commonly used lower bound. Let's run the simulation again with twice as many subjects. Here are the results:

plot of chunk unnamed-chunk-5

This time around, with 50 subjects the estimated power is:
Main effect X1: 5%
Main effect X2: 5%
Interaction: 80%

Doubling the subjects brings us to 80% empirical power, which is about what statisticians recommend as a lower limit when planning a study. In a future study with this sample size we would have a reasonably good chance of rejecting the null hypothesis that the two factors do not interact (of course, assuming we estimated the population parameters accurately). But even with relatively high statistical power we face another lesser known issue.

Sample-to-sample P-value variability

A 2015 Nature Methods article draws attention to a common misunderstanding about the relationship between p-values and statistical power. There can be large variability in the p-value, despite high statistical power:

“Increasing sample size increases statistical power, and thus a progressively greater proportion of P values < 0.05 are obtained. However, we still face substantial variation in the magnitude of the P values returned. Although studies are often planned to have (an estimated) 80% power, when statistical power is indeed 80%, we still obtain a bewildering range of P values (pg. 181).”

Looking at the histogram for the interaction effect above, there really is a “bewildering” range of p-values. And the range of p-values for the underpowered study simulated earlier are even more bewildering!

The power analysis shown in this post can help the experimenter make an a priori decision about the neccessary sample size to achieve high statistical power. But unless statistical power is extremely high (which it rarely is given the myriad practical constraints we face), an identical replication of your study is likely to return a substantially different p-value than the value obtained the first time around. In light of this large sample-to-sample variability, the practice of coming to categorically different conclusions from an effect with a p = 0.04 versus another with p = 0.06 seems even more ridiculuous than it already did.


Here's the full code for the simulations:

# define the parameters
mu = NULL
sigma = NULL 
rho = NULL 
nsubs = NULL
nsims = NULL

# create 2 factors representing the 2 independent variables
cond = data.frame(
  X1 = rep(factor(letters[1:2]), nsubs * 2),
  X2 = rep(factor(letters[1:2]), nsubs, each=2))

# create a subjects factor
subject = factor(sort(rep(1:nsubs, 4)))

# combine above into the design matrix
dm = data.frame(subject, cond)

# create k x k matrix populated with sigma
sigma.mat <- rep(sigma, 4)
S <- matrix(sigma.mat, ncol=length(sigma.mat), nrow=length(sigma.mat))

# compute covariance between measures
Sigma <- t(S) * S * rho  

# put the variances on the diagonal 
diag(Sigma) <- sigma^2  

# stack 'nsims' individual data frames into one large data frame
df = dm[rep(seq_len(nrow(dm)), nsims), ]

# add an index column to track the simulation run
df$simID = sort(rep(seq_len(nsims), nrow(dm)))

# sample the observed data from a multivariate normal distribution
# using MASS::mvrnorm with the parameters mu and Sigma created earlier
# and bind to the existing df

require(MASS)
make.y = expression(as.vector(t(mvrnorm(nsubs, mu, Sigma))))
df$y = as.vector(replicate(nsims, eval(make.y)))             

# use do(), the general purpose complement to the specialized data 
# manipulation functions available in dplyr, to run the ANOVA on
# each section of the grouped data frame created by group_by

require(dplyr)
require(car)
require(broom)

mods <- df %>% 
  group_by(simID) %>% 
    do(model = aov(y ~ X1 * X2 + Error(subject / (X1*X2)), qr=FALSE, data = .)) 

# extract p-values for each effect and store in a data frame
p = data.frame(
  mods %>% do(as.data.frame(tidy(.$model[[3]])$p.value[1])),
  mods %>% do(as.data.frame(tidy(.$model[[4]])$p.value[1])),
  mods %>% do(as.data.frame(tidy(.$model[[5]])$p.value[1])))
colnames(p) = c('X1','X2','Interaction')

power = apply(as.matrix(p), 2, 
  function(x) round(mean(ifelse(x < .05, 1, 0) * 100),0))

# plot the known effects
require(ggplot2)
require(gridExtra)

means = data.frame(cond[1:4, ], mu, SE = sigma / sqrt(nsubs))
plt1 = ggplot(means, aes(y = mu, x = X1, fill=X2)) +
  geom_bar(position = position_dodge(), stat="identity") +
  geom_errorbar(aes(ymin = mu-SE, ymax = mu+SE), 
    position = position_dodge(width=0.9), size=.6, width=.3) +
  coord_cartesian(ylim=c((.7*min(mu)), 1.2*max(mu))) +
  theme_bw()

# melt the data into a ggplot friendly 'long' format
require(reshape2)
plotData <- melt(p, value.name = 'p')

# plot each of the p-value distributions on a log scale
options(scipen = 999) # 'turn off' scientific notation
plt2 = ggplot(plotData, aes(x = p)) +
    scale_x_log10(breaks=c(1, 0.05, 0.001), 
                  labels=c(1, 0.05, 0.001)) +
    geom_histogram(colour = "darkblue", fill = "white") +
    geom_vline(xintercept = 0.05, colour='red') +
    facet_grid(variable ~ .) +
    labs(x = expression(Log[10]~P)) +
    theme(axis.text.x = element_text(color='black', size=7))

# arrange plots side by side and print
grid.arrange(plt1, plt2, nrow=1)

Text files to R to MySQL

Suppose you have several text files that each contain some data about a common set of people, words, countries, companies, or any other sampling unit. This post covers a two-step process to first merge these files into a single R data frame, and then write it to a MySQL database. It would be easy enough to write the resulting data frame to disk as a text file or .RData file. But there are several reasons why one might want to store a dataset in a relational database instead. These reasons include security, performance, ease of collaboration, and memory efficiency.

Read text files into R

Several text files (.csv format) exist in my local directory. Across files there is only one shared column name ('Concept'), which contains a set of nouns. Each file contains a different set of these nouns. Each file also has a unique set of named columns corresponding to unique psycholinguistic variables of interest. I want to create a data frame that consists of only those nouns that are shared across all files, but that contains all variables of interest across all files.

# Get the names of all .csv files in the localdir
f = list.files(pattern='\\.csv', full.names=T)
if(length(f) < 1) stop('No files to read')

# Read each file into a data frame, and place them into a list
df.list = lapply(f, function(x) {read.csv(file=x, header=T)})

Merge data frames

Use the reshape package to merge data frames recursively on the basis of a single, shared, identifying column. That is, the shared column across all data frames containing the sampling unit.

library(reshape)
full_df = merge_recurse(df.list)

# Obviously there are several scenarios when you wouldn't want to do this, 
#  but I will remove all non-complete rows (rows with missing data)
df = full_df[complete.cases(full_df), ] 

# Get a snapshot of the contents of the new data frame (sanity check) 
head(df)
##      Concept Familiarity Pain Smell Color Taste Sound Grasp Motion Length
## 5     anchor        6.92 4.36  2.46  2.58  1.54  2.91  2.00   2.19      6
## 7  apartment        7.64 2.32  3.52  3.67  2.16  2.95  1.38   1.32      9
## 9      apron        7.33 1.46  1.90  5.00  1.50  1.16  6.52   1.52      5
## 14       axe        6.90 7.39  2.10  2.91  1.55  3.68  6.62   3.63      3
## 15       bag        7.45 2.75  1.86  3.87  1.63  2.48  7.04   2.38      3
## 16   bagpipe        6.39 2.16  1.77  5.04  1.57  6.50  3.48   2.16      7
##    SUBTLWF LDT_RT SD_RT domain Num_Feats_No_Tax
## 5     7.41    606   744    NLT               13
## 7    83.04    612   687    NLT               17
## 9     2.67    661   700    NLT               13
## 14    4.88    567   668    NLT               14
## 15   94.04    489   667    NLT               12
## 16    0.29    765   792    NLT                8

Read the new data frame into a MySQL database table

First, I created a free account at Amazon Relational Database Service

Then I used the free OS X data management program Sequel Pro to access my new RDBS account. Within Sequel Pro I created a new MySQL database that I named 'DB1'. Next I will populate a new table in DB1 with the data from the R data frame created above.

library(DBI)
library(RMySQL)

# Any potentially sensitive information (e.g., passwords, usernames) is
# stored in a local text file to prevent me having to type them in the code.
priv = read.delim('private.txt', header=TRUE, stringsAsFactors = FALSE)

# Connect to the MySQL database via a DBI (database interface)  
con <- dbConnect(RMySQL::MySQL(),
    dbname = 'DB1',
    username = priv$dbUser,
    password = priv$dbPass,
    host = priv$dbHost
)

# If a table with the same name exists in this db, remove it
if (dbExistsTable(con, 'word_ratings')){
  dbRemoveTable(con, 'word_ratings')
}
## [1] TRUE
# Write the R data frame to a new table called 'word_ratings'
dbWriteTable(con, 'word_ratings', value = df)
## [1] TRUE
# Error checking
if (!dbExistsTable(con, 'word_ratings')){
  stop('dbWriteTable failed to create a table')
}

Query the new database

Done. Let's query the new database to double check the contents. For example, select short words that refer to very familiar objects (where this familiarity measure ranges from 1 to 8).

dbGetQuery(con, 'SELECT * FROM word_ratings WHERE Familiarity > 7.6 AND Length < 5')
##    row_names Concept Familiarity Pain Smell Color Taste Sound Grasp Motion
## 1         17    ball        7.68 4.48  1.96  4.96  1.68  3.50  6.30   6.21
## 2         54    book        7.78 2.67  3.21  3.65  1.64  2.21  7.26   2.50
## 3         60    bowl        7.62 2.58  1.90  3.92  2.38  1.59  7.04   1.38
## 4         61     box        7.66 2.38  1.70  2.60  1.42  1.74  5.36   2.30
## 5         80     bus        7.71 5.35  3.86  5.96  1.79  6.00  1.32   6.64
## 6         99     car        7.64 6.35  3.59  6.14  1.21  6.35  1.44   6.59
## 7        281    lime        7.62 2.88  6.04  6.76  5.80  1.46  7.42   1.65
## 8        295    menu        7.69 2.09  1.96  3.50  1.91  1.58  6.78   1.41
## 9        345     pen        7.68 3.05  1.88  3.90  1.48  2.16  7.62   4.82
## 10       354     pie        7.61 1.82  6.43  4.64  7.50  1.88  6.05   1.63
## 11       398    rock        7.65 6.00  1.70  1.80  1.48  3.17  6.35   1.80
## 12       437    sink        7.70 2.13  2.64  2.91  1.24  3.74  1.75   1.65
##    Length SUBTLWF LDT_RT SD_RT domain Num_Feats_No_Tax
## 1       4  104.96    501   656    NLT                9
## 2       4  176.98    512   621    NLT               16
## 3       4   21.45    511   786    NLT               14
## 4       3   89.75    528   625    NLT               13
## 5       3   74.18    519   611    NLT               16
## 6       3  483.06    472   636    NLT               15
## 7       4    3.29    535   785     LT               12
## 8       4    9.96    551   799    NLT                8
## 9       3   24.73    511   583    NLT               15
## 10      3   28.75    509   667    NLT               10
## 11      4   86.16    530   630    NLT                9
## 12      4   16.92    568   719    NLT               16
# Don't forget to close the connection
dbDisconnect(con)
## [1] TRUE

In an upcoming post I'll read this data back into R to perform some statistical analyses and create some data visualizations.

See these posts for more information about interfacing R and MySQL databases:

WordPress posts with R Markdown

It seems appropriate for my first post to detail how this very post made it here. Most of the posts I’m planning will have at least some connection to data, and those will rely often on R code. So in the spirit of reproducibility, that code will be included in the posts themselves, as embedded code chunks. Such reproducibility is perhaps most easily accomplished using dynamic documents, and specifically the R version of markdown. Below are a set of code chunks that, taken together, duplicate the contents of an R document that I used to post this post to WordPress.com directly from the R environment.

Each code chunk begins with 3 back ticks and options to (in this case) display the code (echo=TRUE) but not evaluate it (eval=FALSE). The first step is to install the RWordPress package from source, and then make it available.

install.packages('RWordPress', repos = 'http://www.omegahat.org/R', type = 'source')
library(RWordPress)

Next, set some global options to inform WordPress of your credentials and blog url. Obviously I changed what you see below in my own version. I also added the table of contents (“toc”) option to the default options in the HTML renderer in the markdown package. I won’t use a toc in this post, but plan to in later posts.

options(WordPressLogin = c(your_username = 'your_password'),
WordpressURL = 'http://your_wordpress_blog.com/xmlrpc.php')

# include a table of contents
library(markdown)
options(markdown.HTML.options = c(markdownHTMLOptions(default = T),'toc'))

Now I’m ready to call the knit2wp function–a wrapper function around the RWordPress package that compiles an R Markdown document to HTML and posts it to WordPress.

library(knitr)

# Usage:
# knit2wp(input, title = 'A post from knitr', ..., shortcode = FALSE,
# action = c('newPost', 'editPost', 'newPage'), postid,
# encoding = getOption('encoding'), publish = TRUE)

knit2wp('R2MySQL.Rmd',
title = 'test post 1',
shortcode=TRUE,
action = 'newPost',
mt_keywords = c('R', 'knitr', 'knit2wp', 'RWordPress', 'WordPress', 'blog'),
publish=TRUE)

I plan to keep each unique call to knit2wp (new posts or edits to existing posts) in sequential order in my aformentioned R document. This way I have a record of the parameters I used in each subsequent post.

For more information about using RMarkdown to create WordPress posts see: