The Kaggle Titanic competition is a great way to learn more about predictive modeling, and to try out new methods. So what is Kaggle? It is “the world's largest community of data scientists. They compete with each other to solve complex data science problems, and the top competitors are invited to work on the most interesting and sensitive business problems from some of the worldâ€™s biggest companies…” Kaggle provides a few “Getting Started” competitions with highly structured data, including this one. The goal here is to predict who survived the Titanic disaster and who did not based on available information.

In this post and a subsequent post I'll share my journey through data preprocessing, feature engineering, data visualization, feature selection, cross-validation, model selection, and a few submissions to the competition. Along the way I relied many times for ideas and sometimes code, on a few excellent existing R tutorials for the Titanic prediction challenge, including this one and this one.

# Data Preprocessing

Before any modeling or predictive analysis, we need the data in a suitable format. Preprocessing includes assessing and potentially changing variable types, dealing with missing data, and generally getting to know your data. Whenever possible I try to write generic functions that take a dataframe argument, because eventually whatever preprocessing I do on the training data, I'll eventually do on the testing data.

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, collapse=TRUE, error=TRUE) #Read training data that I downloaded from Kaggle website #the 'NA' and blank arguments to 'na.strings' ensures proper treatment of missing data train <- read.csv('train.csv', header=TRUE, na.strings = c("NA","")) #function to change certain variable types changeVars = function(df){ #change ordinal variables coded as numeric to factors df$Pclass <- as.factor(df$Pclass) #change text strings (coded as factors) to character type df$Ticket <- as.character(df$Ticket) df$Cabin <- as.character(df$Cabin) df$Name <- as.character(df$Name) #return new data frame return(df) } #Update training dataset train <- changeVars(train)

Most datasets have missing data. We'll need to identify any missing data, and then decide how to deal with them. Let's check for missing values with the Amelia package 'missmap' function.

require(Amelia) missmap(train, col=c("green", "black"), legend=F)

#Cabin is mostly missing, and Age contains 177 missing values summary(train$Age) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.42 20.12 28.00 29.70 38.00 80.00 177

Let's impute the missing age values by looking at the honorific for each passenger, which are embedded between a comma and period in the Name column. Write a function that returns the original df with a new column of honorifics attached, as well.

require(plyr) Honorifics = function(df) { #first remove characters following period lastnames <- sub('\\..*', '', df$Name) #then remove characters preceding the comma honorific <- sub('.*\\,', '', lastnames) #finally return char. vector w/o leading or trailing whitespace #and attach back to data. This might actually be a useful feature later.s df$honorific <- as.factor(gsub("^\\s+|\\s+$", "", honorific)) #get the median age for each group tmp <- ddply(df, .(honorific), summarize, median.hon.age = round(median(Age, na.rm = T),0)) #if there are sole honorifics, and that person has no age value, #we're out of luck so let's just assume the average age of the #age column for that dataset. tmp$median.hon.age[which(is.na(tmp$median.hon.age))] <- round(mean(df$Age, na.rm = TRUE), 0) #merge data frames tmp2 <- merge(df, tmp) #replace NAs with median 'honorific' age tmp2$Age <- ifelse(is.na(tmp2$Age), tmp2$median.hon.age, tmp2$Age) #return new dataframe return(tmp2) } #update training data and inspect results of new Age variable train <- Honorifics(train) summary(train$Age) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.42 21.00 30.00 29.39 35.00 80.00

The 'Embarked' variable also has a few missing entries, but since there are so few let's just replace them with the most common embarked location: 'S'.

aggregate(Survived ~ Embarked, data=train, FUN=sum) ## Embarked Survived ## 1 C 93 ## 2 Q 30 ## 3 S 217 aggregate(Survived ~ Embarked, data=train, FUN=mean) ## Embarked Survived ## 1 C 0.5535714 ## 2 Q 0.3896104 ## 3 S 0.3369565 summary(train$Embarked) ## C Q S NA's ## 168 77 644 2 ImputeEmbark = function(df) { df$Embarked[which(is.na(df$Embarked))] <- 'S' return(df) } train <- ImputeEmbark(train)

Fare also needs a bit of work due to fares of $0.00. I'll replace $0.00 with the median fare from that class.

summary(train$Fare) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 7.91 14.45 32.20 31.00 512.30 ImputeFare = function(df){ tmp <- ddply(df, .(Pclass), summarize, median.class.fare = round(median(Fare, na.rm = T),0)) #attach to data and return new df tmp2 <- merge(df, tmp) tmp2$Fare <- ifelse(df$Age==0, df$median.class.fare, df$Age) return(tmp2) } #update training data and inspect results train <- ImputeFare(train) summary(train$Fare) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.42 21.00 30.00 29.39 35.00 80.00

## A touch of feature engineering

Now let's take a closer look at these honorifics, because this might be a useful feature. In other words there is potential here for feature engineering: creating new features/variables/predictors from the raw data. This is a crucial aspect of predictive modeling, and data analysis generally. No matter how fancy your model, garbage in, garbage out. Here I'll create a new feature containing honorific classes.

boxplot(train$Age ~ train$honorific)

sort(table(train$honorific)) ## ## Capt Don Jonkheer Lady Mme ## 1 1 1 1 1 ## Ms Sir the Countess Col Major ## 1 1 1 2 2 ## Mlle Rev Dr Master Mrs ## 2 6 7 40 125 ## Miss Mr ## 182 517 #We don't want a feature with single instances so let's subsume the single #honorifics into sensible existing categories with a function HonorificFeatureBuild = function(df){ #create a new column instead of replacing the #existing 'honorific' column (for sake of completeness) df$honorificType <- df$honorific #Function to replace old honorific(s) with new one replace.string <- function(df, original, replacement) { for (x in original) { levels(df$honorificType)[levels(df$honorificType)==x] <- replacement } return(df$honorificType) } #Combine 'Ms' and 'Dona', which I noticed occurs in the test set, with 'Mrs' df$honorificType <- replace.string(df, c('Dona', 'Ms'), 'Mrs') #Combine Madam, Mademoiselle, to equivalent: Miss df$honorificType <- replace.string(df, c('Mme','Mlle'), 'Miss') #Combine several titles of upper class people into 'Noble' df$honorificType <- replace.string(df, c('Sir','Capt','Rev','Dr','Col','Don', 'Major','the Countess', 'Jonkheer', 'Lady'), 'Noble') #Remove dropped levels df$honorificType <- droplevels(df$honorificType) #Return new data frame return(df) } #Return new dataframe and inspect new variable train <- HonorificFeatureBuild(train) sort(table(train$honorificType)) ## ## Noble Master Mrs Miss Mr ## 23 40 126 185 517 boxplot(train$Age ~ train$honorificType)

Mosaic plots are a good way to visualize categorical data, with the height of the bars representing respective proportions and the width representing the size of the group/factor.

require(vcd) mosaicplot(train$honorificType ~ train$Survived, shade=F, color=T)

It looks like this new feature might be useful; we'll see soon enough. But first a little more work in the preprocessing stage. Let's look at the 'Cabin' variable and see about creating another feature there.

CabinFeatures = function(df){ df$Cabin[which(is.na(df$Cabin))] <- 'Unknown' df$CabinAlpha <- gsub('[[:digit:]]+', "", df$Cabin) #some have multiple entries; let's just keep the first one df$CabinAlphaUno <- sub("^(\\w{1}).*$", "\\1", df$CabinAlpha) #cabins B, D, E look good and U looks pretty bad #but do these provide any more information than Pclass? table(df$Pclass, df$CabinAlphaUno) #yes, so let's make GoodCabin and BadCabin features df$GoodCabin <- as.factor( ifelse((df$CabinAlphaUno=='B' | df$CabinAlphaUno=='D' | df$CabinAlphaUno=='E'), 'yes','no')) df$BadCabin <- as.factor(ifelse(df$CabinAlphaUno=='U', 'yes', 'no')) #return new data frame return(df) } #Update df with new Cabin information train <- CabinFeatures(train)

Finally, there are a few more sensible features to add, and preprocessing steps, before we're ready to build models. (Of course we might end up back at this stage later on, for example if we think of a great new feature to extract from the raw training data.)

AddFinalFeatures <- function(df) { #Consolidate #siblings (SibSp) and #parents/children (Parch) #into new family size (FamSize) feature df$FamSize <- df$SibSp + df$Parch #Adjust fare by size of family (per person) df$Fare.pp <- df$Fare / (df$FamSize + 1) #'wcFirst' is for the "women and children first" policy for lifeboats #(https://en.wikipedia.org/wiki/Women_and_children_first). #I'll assume that teenage boys were treated as adults df$wcFirst <- 'no' df$wcFirst[which(df$Sex == "female" | df$Age < 13)] <- 'yes' df$wcFirst <- as.factor(df$wcFirst) return(df) } #Update the df train <- AddFinalFeatures(train)

Last, let's remove any unneccessary columns which will make it easier to define models that include all variables in the data frame.

CleanDF = function(df){ delete <- c("honorific", "Name", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "median.hon.age", "median.class.fare", "CabinAlpha", "CabinAlphaUno") return(df[, !(names(df) %in% delete)]) } #Here is our preprocessed training data train <- CleanDF(train) #Write to disk for later use write.csv(train, "curated_train_data.csv", row.names = F)

## Prepare the test data

Now we're ready to fit models to the training data. Eventually though to submit predictions to Kaggle, we'll need the test data in the same format as our new training data. So here I read the test data and then call all the preprocessing functions (as usual, generic functions are the way to go to avoid repetition).

test <- read.csv('test.csv', header=TRUE, na.strings = c("NA","")) test <- changeVars(test) test <- Honorifics(test) test <- ImputeEmbark(test) test <- ImputeFare(test) test <- HonorificFeatureBuild(test) test <- CabinFeatures(test) test <- AddFinalFeatures(test) test <- CleanDF(test) #Write to disk for later use write.csv(test, "curated_test_data.csv", row.names = F)

Before we continue let's make sure there were no issues with the outputs of any of these functions. When I eventually submit to Kaggle, I need to fit my favored model to the test set, which means the test set can't have any variables, or levels of variables, that it hasn't seen in the training data. There should not be any missing data either.

missmap(test, col=c("green", "black"), legend=F)

Looks good. Now we're ready to build the first model.

# Model Building

What's the goal of predictive modeling? To settle on a statistical model of the outcome variable as some function of our available features. In this case the outcome variable is binary and categorical: whether a person survived the disaster or not. This is a binary classification problem, and there are several ways to approach it.

## Model 1: Logistic regression (simple model)

A common approach to binary classification is the logistic regression model. Logistic regression is not a binary classification method per se, because it estimates the probability of a binary response–ranging continuously from 0 to 1. But by simply rounding the estimated probability to the closest integer (0 or 1) we can use logistic regression as a binary classifier. To start, let's fit a logistic regression model with 3 features I think will be important.

m1.glm <- glm(Survived ~ FamSize + Sex + Pclass, data=train, family='binomial') summary(m1.glm) ## ## Call: ## glm(formula = Survived ~ FamSize + Sex + Pclass, family = "binomial", ## data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2738 -0.7282 -0.4709 0.5903 2.4869 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 2.50663 0.23843 10.513 < 2e-16 *** ## FamSize -0.15038 0.05983 -2.514 0.011953 * ## Sexmale -2.77670 0.19498 -14.241 < 2e-16 *** ## Pclass2 -0.84771 0.24661 -3.437 0.000587 *** ## Pclass3 -1.87347 0.21522 -8.705 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1186.66 on 890 degrees of freedom ## Residual deviance: 820.12 on 886 degrees of freedom ## AIC: 830.12 ## ## Number of Fisher Scoring iterations: 4

Consistent with my hunch about these features, they're all statistically significant. But to get a better sense of the model and its predictions, let's inspect the predicted/fitted values at all levels of the model's features. We apply the `predict`

function to the model along with the chosen levels of the predictor variables in 'newdata'. The default for 'type' is on the scale of the linear predictors (log-odds: probabilities on logit scale). Here we'll choose the alternative option “response” (the scale of the response variable), which gives the predicted probabilities–a more transparent unit to interpret (at least for me).

newdata <- data.frame( FamSize=rep(0:7, 6), Sex=rep(sort(rep(c('male','female'),8)),6), Pclass=as.factor(sort(rep(c('1','2','3'), 16)))) #Add predicted probs newdata$PredProb <- predict(m1.glm, newdata, type = "response") #Plot them at all feature levels require(ggplot2) ggplot(newdata, aes(x=FamSize, y=PredProb, color=Pclass)) + geom_line(aes(group=Pclass)) + geom_point() + geom_hline(yintercept = .5) + facet_grid(. ~ Sex)

The model predicts survivors to be female, in smaller families, and in higher classes. It captures trends in the data, but it wrongly predicts no male survivors when we know that over 100 males did survive. So how well does this model predict the actual outcomes?

First we create a confusion matrix, from which we can compute different measures of classification performance. There are several ways to assess classification performance. Two commonly used methods are “Classification (Observed) Accuracy” and Cohen's Kappa (originally conceived for inter-rater agreement).

Accuracy is just how often the classifier is correct. Kappa measures how well the classifier performed in comparison to how well it would have performed by chance. A model with a high Kappa reflects a large discrepancy between the accuracy (how often does classifier correctly predict outcome) and the null error rate (how often model would be wrong if it always predicted the majority class).

#Attach the model's predicted outcomes to the existing data frame train$predictedSurvival <- round(m1.glm$fitted.values) #First we make the confusion matrix containing the proportions of 'yes' and 'no, #where actual outcomes are rows and predicted outcomes are columns. m1.cm <- as.matrix(table(train$Survived, train$predictedSurvival)) #Function that returns our accuracy measures #it takes a confusion matrix as its argument ClassPerform = function(cm){ #Classification (observed) accuracy #INn signal detection terminology: #[h=hits; cr=correct rejections; fa=false alarms; m=misses] #(h + cr) / (h + cr + fa + m) #Or ... #TruePositive + TrueNegative / Total obs.acc <- (cm[1,1] + cm[2,2]) / sum(cm) #k = (observed accuracy - random accuracy) / (1 - random accuracy) #we already have observed accuracy, but we need random accuracy, #the probability of agreement expected by chance from the observed data #and the model's predictions ObservedNegative <- (cm[1,1] + cm[1,2]) / sum(cm) PredictedNegative <- (cm[1,1] + cm[2,1]) / sum(cm) ObservedPositive <- (cm[2,1] + cm[2,2]) / sum(cm) PredictedPositive <- (cm[1,2] + cm[2,2]) / sum(cm) RandomACC <- (ObservedNegative*PredictedNegative) + (ObservedPositive*PredictedPositive) kappa <- (obs.acc - RandomACC) / (1 - RandomACC) #Return both measures return(data.frame(ClassAcc = obs.acc, Kappa = kappa)) }

So how well does model 1 predict the actual outcomes in the training data?

m1.performance <- ClassPerform(m1.cm) print(m1.performance) ## ClassAcc Kappa ## 1 0.8002245 0.566665

Not too bad. Kappa ranges between -1 and 1, where scores of zero or lower signify no agreement (random relationship) between classifier and reality. Interpretation depends on the field and problem type, but one offered rule of thumb is > 0.75 excellent, 0.40-0.75 fair to good, < 0.40 poor. Let's keep going, we can do better.

## Model 2: Logistic regression (saturated model)

Let's use the same type of model but assume that all our features are good ones that should be included in the model.

#remove the predicted outcomes from before train$predictedSurvival <- NULL #Fit a logistic regression model with all the features (excluding passenger ID) m2.glm <- glm(Survived ~ . - PassengerId, data=train, family='binomial') summary(m2.glm) ## ## Call: ## glm(formula = Survived ~ . - PassengerId, family = "binomial", ## data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4023 -0.5598 -0.3756 0.5518 2.5263 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 2.906e+01 1.014e+03 0.029 0.97713 ## Pclass2 -5.774e-01 4.103e-01 -1.407 0.15937 ## Pclass3 -1.678e+00 4.136e-01 -4.057 4.97e-05 *** ## Sexmale -2.790e+01 1.014e+03 -0.028 0.97804 ## Age -3.198e-02 9.882e-03 -3.236 0.00121 ** ## EmbarkedQ -2.172e-01 3.946e-01 -0.550 0.58203 ## EmbarkedS -5.316e-01 2.493e-01 -2.132 0.03300 * ## honorificTypeMaster 1.622e+01 8.827e+02 0.018 0.98534 ## honorificTypeMiss -1.201e+01 4.983e+02 -0.024 0.98077 ## honorificTypeMr -1.320e-01 6.015e-01 -0.219 0.82626 ## honorificTypeMrs -1.107e+01 4.983e+02 -0.022 0.98228 ## GoodCabinyes 8.652e-01 3.850e-01 2.247 0.02463 * ## BadCabinyes -4.789e-01 3.943e-01 -1.215 0.22454 ## FamSize -4.299e-01 9.330e-02 -4.608 4.06e-06 *** ## Fare.pp 1.364e-03 9.228e-03 0.148 0.88249 ## wcFirstyes -1.314e+01 8.827e+02 -0.015 0.98813 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1186.66 on 890 degrees of freedom ## Residual deviance: 715.58 on 875 degrees of freedom ## AIC: 747.58 ## ## Number of Fisher Scoring iterations: 13

The AIC model fit is substantially lower than the simpler model, despite this model having several more terms (AIC penalizes more complexity/parameters). How well does this model predict the outcomes in the training data?

train$predictedSurvival <- round(m2.glm$fitted.values) m2.cm <- as.matrix(table(train$Survived, train$predictedSurvival)) m2.performance <- ClassPerform(m2.cm) print(m2.performance) ## ClassAcc Kappa ## 1 0.8372615 0.654238

Better than before. But inspection of the Coefficients suggests there may be some uninformative features. One way to get a sense of which ones these might be is to sequentially compare more and more complex models with likelihood ratio tests. This is easy to do with the 'anova' function. But note that the terms are added in the same order as specified in the model, so I need to refit the model placing the terms in what I think is the order of importance based on my domain knowledge and exploratory analyses I've conducted so far.

m2.glm.ordered <- glm(Survived~Sex+Pclass+Age+FamSize+honorificType+ GoodCabin+BadCabin+Embarked+Fare.pp+wcFirst, data=train, family="binomial") #Analysis of Deviance deviance.table <- anova(m2.glm.ordered, test="Chisq") print(deviance.table) ## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: Survived ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 890 1186.66 ## Sex 1 268.851 889 917.80 < 2.2e-16 *** ## Pclass 2 90.916 887 826.89 < 2.2e-16 *** ## Age 1 24.176 886 802.71 8.791e-07 *** ## FamSize 1 14.113 885 788.60 0.0001721 *** ## honorificType 4 56.882 881 731.72 1.310e-11 *** ## GoodCabin 1 9.643 880 722.07 0.0019008 ** ## BadCabin 1 1.290 879 720.78 0.2559915 ## Embarked 2 4.758 877 716.02 0.0926273 . ## Fare.pp 1 0.017 876 716.01 0.8968401 ## wcFirst 1 0.426 875 715.58 0.5139383 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #Visualize the reduction in the residual deviance plot(deviance.table$`Resid. Dev`, type="o")

From the deviances and significance levels of each feature, it looks like we're justified in dropping 'BadCabin', 'Fare.pp', and 'wcFirst'.

## Model 3: Logistic regression (middle ground model)

Let's fit a simpler 3rd model without these features, with the hope that it represents a middle ground between complexity and predictive power.

m3.glm <- glm(Survived~Sex+Pclass+Age+FamSize+honorificType+GoodCabin+Embarked, data=train, family="binomial") summary(m3.glm) ## ## Call: ## glm(formula = Survived ~ Sex + Pclass + Age + FamSize + honorificType + ## GoodCabin + Embarked, family = "binomial", data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.3017 -0.5664 -0.3772 0.5486 2.5202 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 15.837211 492.786269 0.032 0.97436 ## Sexmale -14.886862 492.786370 -0.030 0.97590 ## Pclass2 -0.845554 0.324104 -2.609 0.00908 ** ## Pclass3 -1.957946 0.306748 -6.383 1.74e-10 *** ## Age -0.031807 0.009785 -3.251 0.00115 ** ## FamSize -0.432291 0.080284 -5.385 7.26e-08 *** ## honorificTypeMaster 3.124046 0.797886 3.915 9.03e-05 *** ## honorificTypeMiss -12.096458 492.786161 -0.025 0.98042 ## honorificTypeMr -0.104168 0.586068 -0.178 0.85893 ## honorificTypeMrs -11.162869 492.786169 -0.023 0.98193 ## GoodCabinyes 1.069362 0.346830 3.083 0.00205 ** ## EmbarkedQ -0.202543 0.395174 -0.513 0.60827 ## EmbarkedS -0.509562 0.248663 -2.049 0.04044 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1186.66 on 890 degrees of freedom ## Residual deviance: 717.53 on 878 degrees of freedom ## AIC: 743.53 ## ## Number of Fisher Scoring iterations: 13

As I suspected, the AIC is lower because of the penalization factor. How well does this 3rd model predict the outcomes in the training data?

train$predictedSurvival <- round(m3.glm$fitted.values) m3.cm <- as.matrix(table(train$Survived, train$predictedSurvival)) m3.performance <- ClassPerform(m3.cm) print(m3.performance) ## ClassAcc Kappa ## 1 0.8383838 0.656051

Very close to the saturated model, but a touch more accurate. But given the reduced complexity I favor this model substantially over the saturated model.

# Out-of-sample Prediction

So far we've assessed each model's explanatory power by inspecting in-sample model fit. But ultimately we're after out-of-sample prediction! We need to settle on a good model that can best predict something it has never seen before… hence the word 'predict'. Let's use cross-validation to estimate how well each model performs on data it was not trained on, and therefore shouldn't overfit. I think that the AIC values we saw earlier should give us a hint about out of sample prediction performance, because minimizing the AIC is (asymptotically) equivalent to minimizing at least one type of cross validation value (Stone 1977). Although there are fast and efficient ways to implement what I do below (which I'll rely on later), it helped me understand the process by coding it (mostly) from scratch.

## k-fold cross validation

I'll create a function that performs k-fold CV and returns both performance measures we saw earlier: accuracy and kappa. The function takes three arguments: a data frame, the number of folds, and a model object. The basic idea is to split the data into k chunks, and use each chunk once as the validation data. The model trained on the remaining k-1 chunks tries to predict the outcomes in the validation set, and the resulting performance is aggregated across the k chunks.

set.seed(99) # make the results repeatable from here on out kFoldCV <- function(k, mod){ # get the model data frame df <- mod$model #Randomly partition data into k folds #Use of sample here means we can repeat this function and get a random #set of folds each time the function is called folds <- cut(sample(seq_len(nrow(df))), breaks=k, labels=F) accuracy.mat <- matrix(nrow = k, ncol=2) for (i in 1:k) { #Create training set from the k-1 subsamples trainFold <- subset(df, folds != i) #Create the test set validationFold <- subset(df, folds == i) #Use the fitted model on the test set predValues <- predict(mod, newdata = validationFold, type="response") #Classify predicted values into binary outcome classifier <- ifelse(predValues < 0.5, 0, 1) #Make confusion matrix # (get the name of the outcome variable from the model object) y <- validationFold[,names(mod$model)[1]] confusion.mat <- as.matrix(table(y, classifier)) #Store classification accuracy measures for that run using #'ClassPerform' function created earlier accuracy.mat[i, ] <- t(ClassPerform(confusion.mat)) } #Return both measures in long format return(data.frame(Measure=c('ClassAcc','Kappa'), value=c(mean(accuracy.mat[,1]), mean(accuracy.mat[,2])))) }

Now let's compute 10-fold CV for each model and inspect the results.

m1.ACC <- kFoldCV(k = 10, m1.glm) m2.ACC <- kFoldCV(k = 10, m2.glm) m3.ACC <- kFoldCV(k = 10, m3.glm) print(m1.ACC) ## Measure value ## 1 ClassAcc 0.8002747 ## 2 Kappa 0.5673504 print(m2.ACC) ## Measure value ## 1 ClassAcc 0.8372534 ## 2 Kappa 0.6537724 print(m3.ACC) ## Measure value ## 1 ClassAcc 0.838377 ## 2 Kappa 0.653931

It looks like model 3 has the slight advantage in both cross validated measures. Ok now before continuing, I was curious about the relationship between the number of folds and predictive accuracy. So I ran several CVs at fold sizes from 2 to 60 for model 3, and plotted the relationships.

numFolds <- data.frame(matrix(nrow=59, ncol=2)) for (f in 2:60){ numFolds[f-1, ] <- kFoldCV(f, m3.glm)[, 2] }

Does number of folds influence observed accuracy? Doesn't look like it, but the variance of observed accuracy seems to increase with k.

x <- (2:60) y1 <- numFolds[,1] y2 <- numFolds[,2] plot(x, y1, xlab='No. folds', ylab='Accuracy') lo <- loess(y1~x) lines(predict(lo), col='red', lwd=2)

Does number of folds influence Kappa? Actually yes, it looks like the more folds, the lower the kappa, and the higher the variance. Neither relationship suggests using more than 10 folds–at least for these data (of these dimensions).

plot(x, y2, xlab='No. folds', ylab='Kappa') lo <- loess(y2~x) lines(predict(lo), col='red', lwd=2)

## Repeated k-fold cross validation

In k-fold cross validation on a dataset of this size, there are many many ways to cut up the training data into k folds. But we only implemented one of these ways, and it was randomly selected. To account for the random variability introduced by this selection, statisticians recommend repeating the 10-fold CV process several times and deriving average accuracy measures from each replicate. Let's repeat the validation 1000 times.

reps = 1000 # number of times to replicate the random k selection process m1.reps <- do.call(rbind, lapply(X=1:reps, FUN=function(x) kFoldCV(10, m1.glm))) m2.reps <- do.call(rbind, lapply(X=1:reps, FUN=function(x) kFoldCV(10, m2.glm))) m3.reps <- do.call(rbind, lapply(X=1:reps, FUN=function(x) kFoldCV(10, m3.glm))) repeatedPerform <- rbind(m1.reps, m2.reps, m3.reps) repeatedPerform$mod <- sort(rep(c('m1','m2','m3'), reps*2)) #create data frame with means of repeated CV for each measure and model repCVperformance <- ddply(repeatedPerform, .(mod, Measure), summarize, value = mean(value)) #arrange columns in same order as usual repCVperformance <- repCVperformance[,c(2,3,1)] print(repCVperformance) ## Measure value mod ## 1 ClassAcc 0.8002219 m1 ## 2 Kappa 0.5638397 m1 ## 3 ClassAcc 0.8372607 m2 ## 4 Kappa 0.6515769 m2 ## 5 ClassAcc 0.8383829 m3 ## 6 Kappa 0.6534456 m3

I placed all the performance measures–from the original in-sample model fits, and from the single case and the repeated cross validations, in a dataframe that we'll use for plotting below.

require(reshape) inSample <- melt(rbind(m1.performance, m2.performance, m3.performance), variable_name = "Measure") inSample$mod <- rep(c('m1','m2','m3'), 2) inSample$Method <- "InSample" singleCV <- rbind(m1.ACC, m2.ACC, m3.ACC) singleCV$mod <- sort(rep(c('m1','m2','m3'), 2)) singleCV$Method <- "SingleCV" repCVperformance$Method <- "RepeatedCV" # combine all into one df allMeasures <- rbind(inSample, singleCV, repCVperformance) allMeasures$mod <- as.factor(allMeasures$mod) allMeasures$Method <- as.factor(allMeasures$Method)

Let's visualize the variability, across repeated k-fold validations, of both classification performance measures. I'll add vertical lines to each histogram to signify the accuracy measures from the single cross validation run (dashed line), the mean accuracy measures from the repeated cross validation runs (solid line), and also the the original accuracy measures from the in-sample model fit (red line).

histo = function(measure){ CVreps <- subset(repeatedPerform, Measure==measure) Values <- subset(allMeasures, Measure==measure) ggplot(CVreps, aes(x = value)) + geom_histogram(colour = "grey", fill = "grey", binwidth = 0.0001) + facet_grid(. ~ mod, scales = "free") + geom_vline(data=Values[Values$Method=='SingleCV', ], aes(xintercept=value), linetype="dashed") + geom_vline(data=Values[Values$Method=='RepeatedCV', ], aes(xintercept=value)) + geom_vline(data=Values[Values$Method=='InSample', ], aes(xintercept=value), color="red") + xlab(measure) + theme_bw(base_size = 11) } #Histogram of mean observed accuracy after repeated k-fold cross validation histo("ClassAcc")

#Histogram of mean Kappa after repeated k-fold cross validation histo("Kappa")

There is sampling variability in both measures, but more so in Kappa. Looking at the x-axis of the Accuracy measure, the variability there is pretty small. Overall though the non-zero variance of both measures, from the in-sample model fit, and from the single cross validation run, is apparent in that the vertical lines aren't identical. Also notice that for Kappa, the in-sample fit led to overstated estimates in comparison with the repeated CV estimates… for every model.

The main point of the single and repeated cross validation, however, was to compare different models so we can choose the best model. Here's a simple plot of the performance measures from the candidate models (2 and 3), but on the same scale this time to facilitate cross-model comparison.

betterModels <- subset(allMeasures, mod!="m1") levels(betterModels$Method) <- c("InSample", "SingleCV", "RepeatedCV") ggplot(betterModels, aes(x = mod, y=value, group=Method, color=Method)) + geom_point() + geom_line() + facet_grid(Measure ~ ., scales = "free_y") + theme_bw(base_size = 11)

Again we see the negligible sampling variability of observed accuracy. Cross validation didn't provide any new information. Kappa varies more, and again we see the inflated estimates from the in-sample model fit. But in this particular case, it looks like every measure points to the same conclusion: model 3 is the best model we have so far. I'll stop this post here then, but not before we see how this model does on the actual Kaggle test data. Time for a submission.

# Kaggle Submission

Earlier we prepared the Kaggle test data. So all that's required now is to add a column to 'test' that contains the predicted outcomes according to model 3, remove the other columns (except the index column: PassengerId), write to a .csv file, and drag it onto the Kaggle submission box.

#Predict binary outcomes and attach test$Survived <- round(predict(m3.glm, test, type = "response"),0) submission <- data.frame(PassengerId = test$PassengerId, Survived=test$Survived) #Write to .csv write.csv(submission, 'model3.csv', row.names = FALSE)

I went to the Kaggle website and dragged my csv file to the submission box. Here's how model #3 fared in predicting the outcomes in the previously unseen test data.

What's the performance metric? The score in the Titanic challenge “is the percentage of passengers you correctly predict.” So despite cross validation during model selection, it looks like my model overfit the training data, signified by the substantial decrease in accuracy in the Kaggle test set. But, the eventual winner of a Kaggle competition will be judged on her model's performance on a brand new test set that is held out until the end of the competition. So actually we don't know whether this decrease from training to test will translate into a decrease from training to the real test data.

In the next post I'll try to improve upon this score by delving into the world of recursive partitioning (classification trees, random forest) and by taking advantage of the versatile `caret`

package.