Predictive modeling: Kaggle Titanic competition (part 2)

In a previous post, I documented data preprocessing, feature engineering, data visualization, and model building that culminated in a submission to the Kaggle Titanic competition. I used a linear logistic regression model with a set of features that included some engineered features and some 'out-of-the-box' features, to classify passengers into those who survived and those who perished. This model resulted in about 78% classification accuracy, placing me within the top 40% submissions… not bad, but not great. Let's see if we can do better here by delving into classification trees, ensemble methods like random forests, and the versatile caret package.

Decision tree learning

We've all seen a decision tree before. They often appear as flowcharts, where for example a series of yes/no decisions lead to a particular action. In predictive analysis, the goal of decision tree analysis is to learn or “grow” a tree that effectively predicts a particular outcome (among other outcomes), as a function of input variables (features). Trees are typically learned by a process called recursive partitioning, where the outcome variable is partitioned into subsets of similar or identical responses based on a “test” derived from a particular feature (e.g., Age > 30?). Each branch of the tree represents the outcome of a test (e.g., Yes/No), and each leaf/terminal node is one level of the outcome variable (e.g., survived/perished).

From my background designing and analyzing experiments, one particular feature of decision trees is especially attractive: the ability to include one or several two-way or higher order interactions between variables/features. A hypothetical example of an interaction in a sinking ship disaster is that older males are more likely to perish than survive, but younger females are more likely to perish than older females. That is, gender could interact with age. None of the models in the previous post included any interaction terms, two-way, three-way, or higher. Nor did I even examine any interactions. When there are many possible features and not so many observations (small n large p problems), but even in the present case, the inclusion of (and inspection of) all possible combinations of higher order interactions is not feasible. Additionally, the parameters in substantially more complex models with several and/or higher order interactions often cannot be reliably estimated with ordinary least squares or maximum likelihood estimation.

Preprocessing

I'll load the preprocessed training and test data I created in the previous post. I'll include some of the original features that I removed before model fitting in the previous post. This time around, I'll rely less on my domain knowledge / hunches about which variables to include, and more on the models. I did however remove a few of the original variables that wouldn't make good features, as well as my 'CabinAlpha', 'Good' and 'Bad' cabin features that I'm not happy with. As we did before, I'll perform the identical operations on both the training data and test data provided by Kaggle.

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, collapse=TRUE, error=TRUE)

train <- read.csv('preprocessed_train_data.csv', header = TRUE)
test <- read.csv('preprocessed_test_data.csv', header = TRUE)

CleanDF = function(df){
  delete <- c("honorific", "Name", "Ticket", "Cabin", "CabinAlpha", "BadCabin", "GoodCabin")
  return(df[, !(names(df) %in% delete)])
}

train <- CleanDF(train)
test <- CleanDF(test)

Upon inspecting the variable types in the new dataframe I found a few variable types that need to be changed. This function will do the trick.

#function to change some variable types
changeVars = function(df){
  #change ordinal variables coded as numeric to factors
  df$Pclass <- as.factor(df$Pclass)
  #the outcome variable will only exist in the training data
  if("Survived" %in% names(df)){
    df$Survived <- as.factor(df$Survived)
  }
  #return new data frame
  return(df)
}
train <- changeVars(train)
test <- changeVars(test)

Classification trees

Although several different algorithms are available for decision tree learning, I'll only use one type here, called conditional inference trees. In writing this post I relied heavily on an excellent introduction to classification trees and recursive partitioning more generally by Carolin Strobl and colleagues. Let's start by learning some conditional inference trees. The R functions ctree (which implements classification and regression trees) and cforest (which implements random forests) are both available in the partykit package. As explained in the documentation, the ctree algorithm works roughly like this:

1) Test the global null hypothesis of independence between any of the input variables and the response. Stop if this hypothesis cannot be rejected. Otherwise select the input variable with the strongest association to the response. This association is measured by a p-value corresponding to a test for the partial null hypothesis of a single input variable and the response.
2) Implement a binary split in the selected input variable.
3) Recursively repeat steps 1) and 2).

# set a random seed to ensure repeatability
set.seed(99)
require(partykit)

The ctree function takes an argument for an object of class “TreeControl”, returned by the function ctree_control, containing various parameters for tree learning. Let's start simple, with a tree that is forced to have only 1 split (i.e., only 1 input variable will be selected to partition the outcome variable). I'll call it m5 for model 5, carried over from the previous post using logistic regression models.

tc_d1 <- ctree_control(maxdepth = 1)
m5 <- ctree(Survived ~ ., data = train, control = tc_d1)
print(m5)
## 
## Model formula:
## Survived ~ Pclass + PassengerId + Sex + Age + SibSp + Parch + 
##     Fare + Embarked + median.hon.age + median.class.fare + honorificType + 
##     CabinAlphaUno + FamSize + Fare.pp + wcFirst
## 
## Fitted party:
## [1] root
## |   [2] CabinAlphaUno in A, T, U: 0 (n = 703, err = 30.3%)
## |   [3] CabinAlphaUno in B, C, D, E, F, G: 1 (n = 188, err = 31.4%)
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2

The printed output shows that the single split was on the variable 'CabinAlphaUno', meaning it exhibited the strongest association with Survival. I crudely created this feature from the existing 'Cabin' variable, by simply taking the first letter of the string. So this tells me that there's probably better feature engineering to be done with the Cabin variable. I'm going to bypass 'CabinAlphaUno' for a moment and grow a tree that doesn't have access to it.

m6 <- ctree(Survived ~ . -CabinAlphaUno, data = train, control = tc_d1)
print(m6)
## 
## Model formula:
## Survived ~ Pclass + PassengerId + Sex + Age + SibSp + Parch + 
##     Fare + Embarked + median.hon.age + median.class.fare + honorificType + 
##     FamSize + Fare.pp + wcFirst
## 
## Fitted party:
## [1] root
## |   [2] honorificType in Master, Miss, Mrs: 1 (n = 351, err = 27.9%)
## |   [3] honorificType in Mr, Noble: 0 (n = 540, err = 16.5%)
## 
## Number of inner nodes:    1
## Number of terminal nodes: 2

Now the strongest association with the outcome variable was the honorifics I extracted from the passenger names. This feature was used to split the outcomes into 2 terminal nodes, with what appears to be mostly men (save a few female noblepersons) on one side of the split, and women plus boys on the other. Let's plot a simple version of this classification tree.

plot(m6, type="simple")

plot of chunk unnamed-chunk-5

The grey terminal nodes provide the number of observations per node, plus the classification error. So 28% of the women and boys did not survive, whereas 17% of the men and “nobles” did. The default plot helps visualize these ratios, with color representing the outcome.

plot(m6)

plot of chunk unnamed-chunk-6

It looks to me like the main effect of 'honorificType' is actually capturing an Age by Sex interaction. Let's grow a tree that is forced to include just those two variables.

m7 <- ctree(Survived ~ Sex+Age, data = train)
plot(m7)

plot of chunk unnamed-chunk-7

Yep. This tree represents a statistically significant Age x Sex interaction. Females generally survived regardless of age, but for males age mattered. But, even the boys (males under 12) fared worse than the females as a monolithic group; just slightly more than half of the boys survived. So the “women and children first” strategy didn't work completely. But still my hunch is that if we don't let the tree select either the CabinAlphaUno or the hhonorific variables, then “wcFirst” will be selected first among other candidates. Let's find out.

m8 <- ctree(Survived ~ wcFirst, 
  data = train[, !(names(train)%in% c('CabinAlphaUno', 'honorificType'))], control = tc_d1)
plot(m8)

plot of chunk unnamed-chunk-8

Indeed, the “women and children first” feature is the 3rd runner up for the initial split. Ok, curiosity satisfied. Now let's stop restricting the tree, and let the algorithm decide on its own which features to choose, and when to stop growing the tree.

m9 <- ctree(Survived ~ ., data = train)
plot(m9)

plot of chunk unnamed-chunk-9

Now that's a tree. There are several higher-order interactions here, including something that you don't see in linear models with interactions terms (including the experimenter's choice: analysis of variance). That is, the 'CabinAlphaUno' is split once at the first level, and then again two levels deeper on a branch that stems from itself. This is entirely possible in decision trees, because at each split every variable is tested once again.

Does this tree reveal anything interpretable over and above the Age x Sex interaction we already saw (and that is driving the second-level splits here)? For one thing, cabin has a particularly strong association with survival, winning the spot for the first split–despite that over ¾ of the passengers' cabin assignments were unknown ('U'). The other striking relationship I notice is that among the women and children (leftmost branch) possessing a basic class ticket (#3), family size matters a lot. For example, among those 45 people belonging to larger families (4+ people), almost none survived.

So how well does this tree's predicted/fitted outcomes capture the observed outcomes? I'll call the ClassPerform function defined in the previous post, which takes a confusion matrix as its argument and returns two measures of binary classification performance: observed accuracy and kappa. Also note that the tree's predicted responses (which we need to build the confusion matrix) are obtained in the usual way with R's predict function.

# Confusion matrix 
m9.cm <- as.matrix(table(train$Survived, predict(m9, train, type = "response")))
ClassPerform(m9.cm)
##    ClassAcc     Kappa
## 1 0.8338945 0.6421279

This tree fits the observed data no better than the much simpler logistic regression model in the previous post. But does this neccessarily mean that its predictive accuracy is no better? No, because a model's fit to the observed data is a measure of its explanatory power (assuming an underlying casual theory, but that's another issue). At the very least, these measures can help sort out which model best describes known data. Prediction and explanation are often conflated or poorly specified, including in the professional scientific literature (e.g., probably in one of my papers). I like this description of each by Shmueli:

In explanatory modeling the focus is on minimizing bias to obtain the most accurate representation of the underlying theory. In contrast, predictive modeling seeks to minimize the combination of bias and estimation variance, occasionally sacrificing theoretical accuracy for improved empirical precision… the “wrong” model can sometimes predict better than the correct one.

We could derive predicted outcomes for a new dataset, like the Kaggle test data here, but there's an issue with single classification trees that decreases their utility as classifiers. What's the problem? Here is Strobl and colleagues:

The main flaw of simple tree models is their instability to small changes in the learning data: In recursive partitioning, the exact position of each cutpoint in the partition, as well as the decision about which variable to split in, determines how the observations are split up in new nodes, in which splitting continues recursively. However, the exact position of the cutpoint and the selection of the splitting variable strongly depend on the particular distribution of observations in the learning sample. Thus, as an undesired side effect of the recursive partitioning approach, the entire tree structure could be altered if the first splitting variable, or only the first cutpoint, was chosen differently because of a small change in the learning data. Because of this instability, the predictions of single trees show a high variability.

Bagging and random forests

So single trees are prone to high variance predictions (i.e., overfitting). Did anybody come up with a better idea? Yes. A better approach involves repeatedly growing trees from randomly selected bootstrapped samples of the training data (samples of the same size, but with replacement), and making predictions based on the whole ensemble of trees. This bootstrap aggregating, or bagging, is a popular ensemble approach. There is another method that aggregates even more diverse trees. In addition to the growing several trees from bootstrapped samples, each tree could be made to choose from a randomly restricted subset of all available features. This ensemble approach is called a random forest. Among the parameters of random forests are the number of trees to grow, and the number of randomly selected features to choose from. Theoretical and empirical evidence suggests that individual classification trees are unbiased (but high variance) predictors, so these ensemble methods mitigate the high variance by growing many trees.

Let's build a bagged ensemble, which we can implement in cforest by setting mtry to Inf (i.e., the full set of available features).
We'll also build a random forest, using the default for 'mtry'. We'll grow 500 trees (the default) for each model.

m10 <- cforest(Survived ~ ., data = train, ntree = 500, mtry = Inf)
m11 <- cforest(Survived ~ ., data = train, ntree = 500)

So where's the tree? Well, while there are ways to plot one of the 500 trees that make up the ensemble, it doesn't make sense like it did with the single trees we grew earlier. Ensemble classifiers use “votes” from hundreds of independently sampled trees, each of which has only seen a subset of all the training data, and a subset of all available features, to settle on a predicted outcome. So no single tree in the ensemble will neccessarily give us a sense of the forest. The main goal here though is prediction, so let's do that. Actually, we don't have access to a test set that the models haven't seen before. But we can use the imperfect alternative of out-of-bag (OOB) error estimation. Basically, the method is to repeatedly evaluate predictions from a given tree on held-out data that wasn't available for growing that particular tree (i.e., data that was not in the bootstrap sample used to grow that tree). Here I'll obtain the confusion matrix for each model (incorporating the OOB prediction error), and call my ClassPerform function to get the accuracy measures.

# OOB classification of true vs. predicted classes
m10.cf <- table(predict(m10, OOB=TRUE), train$Survived)
m11.cf <- table(predict(m11, OOB=TRUE), train$Survived)

# Bagging
ClassPerform(m10.cf)
##    ClassAcc     Kappa
## 1 0.8451178 0.6640437

# Random forest
ClassPerform(m11.cf)
##    ClassAcc     Kappa
## 1 0.8361392 0.6465669

Kaggle submission

In comparison with the repeated 10-fold cross-validated prediction accuracy of my best logistic regression model in the previous post, these aren't any better, maybe a bit worse. But I can't help submit one of these models' predictions to Kaggle, to see how the model fares on a truly new dataset. I'll submit the predictions from the ensemble of bagged trees, but now I'll make predictions from the model trained on all of the available data (by setting OOB to FALSE).

Survived <- as.factor(predict(m10, test, OOB=FALSE))
submission <- data.frame(PassengerId = test$PassengerId, Survived=Survived)

#Write to .csv
write.csv(submission, 'bag.csv', row.names = FALSE)

How did I do?

plot of chunk unnamed-chunk-15

It's an improvement on the logistic regression model. Less than 0.5% improvement, but I nonetheless moved up 343 spots on the leaderboard. The margins of success are increasingly small as prediction accuracy gets higher. There is a lot more we can do with ensemble methods in partykit, but there are so many other kinds of predictive models out there with their own R packages. If only there was a way to access them all, and their associated methods, in an integrated framework… oh wait there is a way.

Modeling with the caret package

The caret (Classification And Regression Training) package provides a standardized interface to hundreds of different R functions useful for predictive modeling. The author of this caret co-authored the book Applied Predictive Modeling, in which caret features heavily.

At the heart of caret is the train function, which can 1) estimate model performance from a training set; 2) use resampling approaches to evaluate the effects of model tuning parameters on model performance; 3) choose the optimal model across these parameters. So far we've done some of the former, but neither of the latter. Here is pseudocode for the general train algorithm:

  1. Define model parameters to evaluate.
  2. FOR each parameter:
  3. ….FOR each resampling iteration:
  4. …….Hold-out 'test' data
  5. …….Fit model on remainder of 'training' data
  6. …….Predict the outcome variable on the test data
  7. ….END
  8. ….Compute mean accuracy metrics across test sets
  9. END
  10. Determine optimal parameters.
  11. Fit the best model to the full training dataset.
  12. Optionally derive predictions for new test dataset.

Ok so let's continue with the ensemble approach from above. We'll build a random forest of conditional inference trees (i.e., cforest function), but this time we'll use train to systematically evaluate the influence of the parameter 'mtry' (# of randomly selected predictors).

First, we choose a resampling method and define its parameters. To maintain consistency with what I've done so far, I'll use repeated 10-fold cross validation (repetition = 5).

require(caret)

RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=5,
  verboseIter = FALSE, allowParallel = TRUE)

Now we call train, which takes the model formula, data, estimation method, and the train control object as arguments. How does train select which levels of parameters to evaluate? By default, if p is the number of tuning parameters, the parameter grid size is 3p^ The user can also input a custom grid, though, which is what I elected to do here to sample more of the parameter space.

customGrid <-  expand.grid(mtry=seq(3,27,by = 2))
t1 <- train(Survived ~ ., data = train, method="cforest", 
  trControl = RfoldCV,  tuneGrid=customGrid)

Let's examine some of the contents of the returned object, starting with the classification accuracy at different levels of the 'mtry' parameter.

print(t1, details=T)
## Conditional Inference Random Forest 
## 
## 891 samples
##  15 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 803, 802, 802, 802, 802, 801, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
##    3    0.8103584  0.5748504  0.04065178   0.09170079
##    5    0.8202288  0.6095693  0.04218561   0.09216546
##    7    0.8258343  0.6222412  0.04172612   0.09090105
##    9    0.8247132  0.6179529  0.04252162   0.09278510
##   11    0.8262812  0.6202889  0.04386115   0.09686357
##   13    0.8262787  0.6194249  0.04294836   0.09550746
##   15    0.8267382  0.6199570  0.04319338   0.09579875
##   17    0.8271826  0.6202590  0.04398649   0.09809832
##   19    0.8262812  0.6183850  0.04236671   0.09445320
##   21    0.8265033  0.6185649  0.04280261   0.09552688
##   23    0.8240339  0.6134755  0.04264691   0.09466191
##   25    0.8240162  0.6133688  0.04448782   0.09864004
##   27    0.8244782  0.6142466  0.04353985   0.09642757
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 17. 
## 
## ----------------------------------------------------------
## 
## The final model:
## 
## 
##   Random Forest using Conditional Inference Trees
## 
## Number of trees:  500 
## 
## Response:  .outcome 
## Inputs:  Pclass2, Pclass3, PassengerId, Sexmale, Age, SibSp, Parch, Fare, EmbarkedQ, EmbarkedS, median.hon.age, median.class.fare, honorificTypeMiss, honorificTypeMr, honorificTypeMrs, honorificTypeNoble, CabinAlphaUnoB, CabinAlphaUnoC, CabinAlphaUnoD, CabinAlphaUnoE, CabinAlphaUnoF, CabinAlphaUnoG, CabinAlphaUnoT, CabinAlphaUnoU, FamSize, Fare.pp, wcFirstyes 
## Number of observations:  891

The optimal forest emerged when several but not all predictors were made available. In addition to the mean accuracy metrics, the train object also contains the individual metrics from each prediction (i.e., number of folds x repetitions) for the optimal model. We can get a sense of the variability (and sampling distribution) of each measure by plotting histograms.

hist(t1$resample$Accuracy, col="grey")

plot of chunk unnamed-chunk-19

hist(t1$resample$Kappa, col="green")

plot of chunk unnamed-chunk-19

We also can inspect performance across the parameter space (in this case, just one parameter) with a ggplot method.

ggplot(t1)

plot of chunk unnamed-chunk-20

Kaggle submission

Our best Kaggle submission so far was a bagged ensemble of trees, but the above plot suggests that with the current set of predictors, a random forest with a subset of most but not all randomly selected predictors is more accurate than when all predictors are available. Let's make a prediction.

The predict.train function takes a train object as its first argument, which contains only the optimal model. That is, you should not supply train$finalModel to this argument.

Survived <- as.factor(predict.train(t1, test))
submission <- data.frame(PassengerId = test$PassengerId, Survived=Survived)

#Write to .csv
write.csv(submission, 'rand_forest.csv', row.names = FALSE)

How did we do?

plot of chunk unnamed-chunk-22

Actually, a significant improvement. We used the same set of candidate features that were available in my previous submissions, but this time caret helped us zero in on a better ensemble (random forest) model. This is a subtantial jump on the leaderboard, and we're now within the top 25% of submissions. So how to improve? We could perform more comprehensive comparisons of this model with several other models available in R and the caret interface. But actually I'm not convinced that we've extracted the best set of features from the out of the box training data provided by Kaggle. In the next and final post I'll start from scratch from the Kaggle data, engineer a few more more features, fit a wider variety of models, and hopefully end with my best submission.

Advertisements