In my first Kaggle Titanic post and the followup post, I walked through some R code to perform data preprocessing, feature engineering, data visualization, and model building for a few different kinds of models. In this post I'll build a new feature set (including model-based missing data imputation), train and tune different models including random forests and neural networks, assess variable importance, perform feature selection, and submit a few entries to the Kaggle contest along the way. Some of the code and methodology in this post is reused from the previous posts where I provided more detail, so I recommend reading those first. But this post can stand on its own if you're already comfortable with R.

# Preprocessing

I'll start by making a few neccessary changes to the raw data downloaded from the Kaggle Titanic competition website.

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, collapse=TRUE, error=FALSE) # maintain the same random seed throughout rand.seed <- 99 set.seed(rand.seed) # Kaggle files train <- read.csv('train.csv', header=TRUE, na.strings = c("NA","")) test <- read.csv('test.csv', header=TRUE, na.strings = c("NA","")) changeTypes = function(df){ df$Pclass <- as.factor(df$Pclass) df$Ticket <- as.character(df$Ticket) df$Cabin <- as.character(df$Cabin) df$Name <- as.character(df$Name) if("Survived" %in% names(df)){ df$Survived <- as.factor(df$Survived) } return(df) } train <- changeTypes(train) test <- changeTypes(test) # extract honorifics and surnames NameExtract = function(df) { # remove first names following period lastnames <- sub('\\..*', '', df$Name) # extract honorifics honorific <- sub('.*\\,', '', lastnames) # extract surnames surname <- sub('\\,.*', '', lastnames) # clean up white space etc. df$honorific <- as.factor(gsub("^\\s+|\\s+$", "", honorific)) df$surname <- as.factor(gsub("^\\s+|\\s+$", "", surname)) # we don't need the names anymore df$Name <- NULL return(df) } train <- NameExtract(train) test <- NameExtract(test) # transform honorifics into useable feature honorificBuild = function(df){ # function to replace old honorific(s) with new one replace.string <- function(df, original, replacement) { for (x in original) { if(x %in% levels(df$honorific)) { levels(df$honorific)[levels(df$honorific)==x] <- replacement } } return(df$honorific) } # create ‘types’ of honorifics, try not to mix genders df$honorific <- replace.string(df, c('Dona','Ms','the Countess','Jonkheer','Lady'),'Mrs') df$honorific <- replace.string(df, c('Mme','Mlle'), 'Miss') df$honorific <- replace.string(df, c('Sir','Capt','Rev','Dr','Col','Don','Major'), 'Noble') df$honorific <- droplevels(df$honorific) return(df) } train <- honorificBuild(train) test <- honorificBuild(test)

Here I'll impute a few missing Cabin names based on unique family names, and then retain the letters because they correspond to the deck of the ship. It seems like deck could be a useful feature.

DeckFeature = function(df){ # remove digits from cabin names df$Cabin <- gsub('[[:digit:]]+', "", df$Cabin) # retain single letter (corresponds to the deck) df$Deck <- sub("^(\\w{1}).*$", "\\1", df$Cabin) df$Cabin <- NULL return(df) } train <- DeckFeature(train) test <- DeckFeature(test)

There are a lot of missing 'Cabin' rows. One minor fix we can do is check if only one member of a given family was assigned a cabin. If so, we can assume that the other family members were in the same cabin. But finding members of a given family isn't as simple as picking unique surnames–more than one family might have the same surname. I'll create a variable 'FamID' that attempts to find surnames belonging to a single family. By pasting together the surname, the class, and the point of departure, I hope that these new ID variables accurately pick out unique families.

pickFams = function(df){ df$FamID <- as.factor(paste(df$surname, df$Pclass, df$Embarked, sep="")) return(df) } train <- pickFams(train) test <- pickFams(test)

There are a few cases in the 'train' dataset where at least one member of a family has a deck assigned to them, and at least one member of the same family does not. For example, in the subset of 'train' data where last names are 'Brown'.

subset(train, surname=='Brown')[, (ncol(train) - 3):(ncol(train))] ## honorific surname Deck FamID ## 195 Mrs Brown B Brown1C ## 346 Miss Brown F Brown2S ## 671 Mrs Brown <NA> Brown2S ## 685 Mr Brown <NA> Brown2S

Let's create a dplyr chain that fills in the 'NA' family members with the correct Deck. The 'yes' value for the `ifelse`

function utilizes the`na.locf`

function, which stands for *last observation carried forward*.

library(dplyr) ImputeDeck = function(df){ new_df <- df %>% group_by(FamID) %>% mutate(Deck = ifelse((n_distinct(Deck) > 1 & any(is.na(Deck))), yes = zoo::na.locf(Deck), no = Deck)) %>% as.data.frame() # change remaining NAs to 'U' for Unknown new_df$Deck[which(is.na(new_df$Deck))] <- 'U' # change to a factor new_df$Deck <- as.factor(new_df$Deck) # don't need FamID anymore new_df$FamID <- NULL return(new_df) } train <- ImputeDeck(train) test <- ImputeDeck(test)

Here are those 'Brown' family members again. Looks good.

subset(train, surname=='Brown')[, (ncol(train) - 3):(ncol(train))] ## Embarked honorific surname Deck ## 195 C Mrs Brown B ## 346 S Miss Brown F ## 671 S Mrs Brown F ## 685 S Mr Brown F

Next, ticket numbers might have some useful information. This function cleans them up, and extracts a few features.

TicketFeature = function(df) { # remove punctuation df$Ticket <- gsub('[^[:alnum:]]+', "", df$Ticket) # retain numbers only df$Tix.dig <- as.numeric(gsub('[[:alpha:]]+', "", df$Ticket)) # perhaps the number of digits in the ticket is information # any tickets with less than 4 digits are relatively rare, # so subsume those into one level of this feature df <- df %>% mutate(tix.grp = ifelse(nchar(as.character(df$Tix.dig)) < 4, 3, nchar(as.character(df$Tix.dig)))) %>% as.data.frame() df$tix.grp <- as.factor(df$tix.grp) df$Tix.dig <- NULL df$Ticket <- NULL return(df) } train <- TicketFeature(train) test <- TicketFeature(test)

Rather than replacing missing values of a numeric variable with its mean or median as we did in the previous post, we can treat the missing data problem as another predictive modeling problem. We can train a model to learn a functional relationship between predictors x,y,z and predictor *n*, using the subset of data in which *n* is non-missing. Then we can use this model to predict the missing values in *n*. Let's build a random forest to impute the missing values for 'Age', and the missing values (i.e., '0.00') for 'Fare' using a subset of predictors that I think are relevant.

library(partykit) imputeForest = function(df, missing.var) { #special case for 'Fare' if(missing.var=="Fare"){ df$Fare[df$Fare < 0.01] <- NA } missing.df <- df[is.na(df[ , missing.var]), ] missing.df[, missing.var] <- NULL existing.df <- df[!is.na(df[ , missing.var]), ] # make the response a variable 'y' existing.df$y <- existing.df[, missing.var] existing.df[, missing.var] <- NULL forest <- cforest(y ~ Sex+Pclass+honorific+SibSp+Parch, data=existing.df, ntree = 100) missing.df[ ,missing.var] <- predict(forest, newdata = missing.df) existing.df[, missing.var] <- existing.df$y existing.df$y <- NULL newdf <- rbind(missing.df, existing.df) return(newdf) } train <- imputeForest(train, "Age") test <- imputeForest(test, "Age") train <- imputeForest(train, "Fare") test <- imputeForest(test, "Fare")

Are Age and Fare ready to be added to predictive models? First, there's no need for age to have the precision of several decimal places so let's make Age an integer variable. Infants < 1 year old will be coded as 1.

Age2Int = function(df){ df$Age <- as.integer(df$Age) df$Age[which(df$Age==0)] <- 1 return(df) } train <- Age2Int(train) test <- Age2Int(test)

Second, a quick look at the distribution of Fare reveals that it's highly positively skewed–a few tickets were way more expensive than most.

library(ggplot2) ggplot(train, aes(x=Fare)) + geom_histogram(fill="grey") + theme_light()

Taking the logarithm of Fare transforms it into a less skewed variable that may be more transparently related to survival, as is visible here.

ggplot(train, aes(x=Fare)) + geom_histogram(aes(y=..density..), fill="grey") + geom_density(alpha=.2, fill="red") + scale_x_log10() + theme_light()

A quick way to see the difference in how log(Fare) is related to Survival is to run logistic regression models with both versions of Fare. I included 'Pclass' as well to soak up shared variance.

summary(glm(Survived ~ Fare + Pclass, data=train, family='binomial')) ## ## Call: ## glm(formula = Survived ~ Fare + Pclass, family = "binomial", ## data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9052 -0.7679 -0.7330 1.0807 1.7026 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.052002 0.226587 0.230 0.8185 ## Fare 0.006027 0.002386 2.526 0.0115 * ## Pclass2 -0.288297 0.240812 -1.197 0.2312 ## Pclass3 -1.275935 0.227394 -5.611 2.01e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1186.7 on 890 degrees of freedom ## Residual deviance: 1075.1 on 887 degrees of freedom ## AIC: 1083.1 ## ## Number of Fisher Scoring iterations: 4 summary(glm(Survived ~ log(Fare) + Pclass, data=train, family='binomial')) ## ## Call: ## glm(formula = Survived ~ log(Fare) + Pclass, family = "binomial", ## data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.6769 -0.8572 -0.6928 1.0835 1.7778 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.1892 0.5227 -2.275 0.022912 * ## log(Fare) 0.4153 0.1223 3.396 0.000685 *** ## Pclass2 -0.1322 0.2521 -0.525 0.599907 ## Pclass3 -0.9671 0.2672 -3.620 0.000295 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1186.7 on 890 degrees of freedom ## Residual deviance: 1071.6 on 887 degrees of freedom ## AIC: 1079.6 ## ## Number of Fisher Scoring iterations: 4

The AIC for the log Fare model, and p-value for log Fare, in comparison to those metrics from the 'raw' fare model, both suggest that this transformation is a good idea. So let's make it a feature in the dataframes. I'll also round it to 2 decimal places.

logFare = function(df) { df$Fare <- round(log(df$Fare), 2) return(df) } train <- logFare(train) test <- logFare(test)

We're almost ready to build models. I'll fill in a few missing 'Embark' values, create a family size variable by adding SibSp and Parch, and remove the surname variable.

FinalTouches = function(df) { # impute 'Embark' by replacing any NAs with most common destination df$Embarked[which(is.na(df$Embarked))] <- 'S' # Family size df$FamSize <- df$SibSp + df$Parch # Get rid of surnames df$surname <- NULL return(df) } train <- FinalTouches(train) test <- FinalTouches(test)

# Decision tree

The first model I want to see is a conditional inference tree grown from the full training data. Since it's only a single tree, it's likely overfitting the data, but I'm interested here in understanding the relationships between the predictors and survival, not prediction (yet).

decision_tree <- ctree(Survived ~ . -PassengerId -Name, data = train) plot(decision_tree, gp = gpar(fontsize=8))

Consistent with the previous post, passenger class and honorific title are key predictors. The two largest terminal nodes in the tree are node 8, which contains 182 women and children in higher passenger classes (of which over 95% survived), and node 13, which contains the 319 men in the lowest passenger class (of which only 11% survived). Besides upper class women and children, the only other group that was more likely to survive than perish was lower class women and children belonging to families of less than four people–but even this group didn't fare that well.

# Random forest

Now on to prediction. We saw in the previous post that an ensemble of decision trees called a random forest was our best predictive model so far. To review, random forests are the products of growing several trees from bootstrapped samples of the training data, where each tree has access to a randomly selected subset of available features. Here we'll use the `caret`

package to fit several random forests, each having access to a different number of features. Based on the forest parameter sweep I did in the previous post, which suggested that several features were better, I'll only sample a small subset of this parameter's range. The 'best' forest is simply that with the highest averaged resampling (e.g., k-fold cross validation) accuracy. If you're wondering why best is in quotations, skip to the end of the post!

detach("package:partykit", unload=TRUE) library(caret) RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=10) customGrid <- expand.grid(mtry=seq(16,20,by = 2)) t1 <- train(Survived ~ . -PassengerId, data = train, method="cforest", trControl = RfoldCV, tuneGrid=customGrid) print(t1) ## Conditional Inference Random Forest ## ## 891 samples ## 12 predictor ## 2 classes: '0', '1' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 803, 802, 802, 801, 802, 802, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 16 0.8254549 0.6191893 0.03498533 0.07824130 ## 18 0.8292827 0.6269744 0.03441446 0.07789530 ## 20 0.8278144 0.6238986 0.03396997 0.07640288 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 18.

The best accuracy here is a bit better than my previous random forest, so I submitted this model's predictions on the Kaggle test data. I scored about 76% accuracy, which is worse than several of my previous models. Why? I have a hunch that this model is over-parameterized–there were 28 features that made it into the forest. If we decide to simplify the model, which features should be transformed or dropped? One way to decide is to assess variable importance.

# Variable importance

The importance of a feature in a predictive model can be estimated in several different ways (often depending on the type of model and its assumptions). In general, one can measure loss of prediction accuracy after removing a feature from a model. This version is similar in spirit to the common case of model comparison with the Chi-square test. Another example is linear models, where variable importance can be estimated with the absolute value of the t-statistic (ratio of effect size to standard error). The `caret`

package includes several model-specific variable importance measures, as well as a generic version that doesn't rely on any particular model. Let's estimate variable importance for the random forest.

We'll use the permutation-based measure available in the `party`

package. Each predictor is randomly permuted (shuffled row-wise) to remove its association with the response. Then this new variable along with the remainder of predictors are used to predict the response in the out-of-bag sample. The average drop in prediction accuracy after the variable has been permuted (across all trees) is the measure of importance. Let's see where our predictors stand.

model.imp <- varImp(t1, scale=F, useModel = T, conditional=F) plot(model.imp)

Consistent with the single tree grown above, variables related to socioeconomic status (Pclass, Fare), sex (Sex, honorific), and age (Age, honorific) seem to be the key predictors. What's not important? Notice that 'U' is the only level of Deck that's moderately important–but remember that this just means unknown. So one idea is to transform the deck variable into known vs. unknown. In addition, the ticket group variable seems fairly unimportant–we could just drop it.

One thing to note is that Strobl and colleagues have shown that the standard `cforest`

permutation-based importance measure can be misled by spurious correlations (e.g., inability to recognize that age but not shoe size is the relevant predictor of reading skill). To remedy this concern, they created a conditional importance measure that takes into account the correlation structure among predictors in the permutation scheme. The result is that variable importance is conditional on the other predictors (c.f., semi-partial regression coefficients). This method is implemented by setting conditional to true in the `varImp`

function; it's not run here because it's very time intensive.

Some predictive models don't have an obvious way to measure variable importance, in which case one can compute a 'generic' measure of variable importance that relies on ROC curve analysis. The general idea is to try several different cutoff values (decision thresholds) of each predictor, computing sensitivity and specificity for each (both measures are available from the confusion matrix; see part 1). The results can be plotted as an ROC curve (i.e., true positives plotted against false positives at each of these thresholds). Variable importance is taken as the area under the ROC curve. We can implement this measure by setting 'useModel' to FALSE.

auc.imp <- varImp(t1, scale=F, useModel = F) plot(auc.imp)

Many of the same conclusions can be drawn from this measure. But ticket group seems to be moderately important here, so let's take a look with a mosaic plot.

library(vcd) mosaicplot(train$tix.grp ~ train$Survived, shade=F, color=T)

Perhaps there's something unique about tickets with 5 digits: the majority of those people survived as opposed to every other ticket group. So let's transform deck and ticket group to binary variables that not only may be more predictive, but will reduce model complexity.

Simplify = function(df) { df$Deck <- ifelse(df$Deck=='U', 'unknown','known') df$tix.grp <- ifelse(df$tix.grp==5, 'five','other') df$Deck <- as.factor(df$Deck) df$tix.grp <- as.factor(df$tix.grp) return(df) } train <- Simplify(train) test <- Simplify(test)

Now I'll build another random forest using these features, but this time I'll let `train`

choose the 'mtry' parameters.

set.seed(rand.seed) RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=10, allowParallel = TRUE) t2 <- train(Survived ~ . -PassengerId, data = train, method="cforest", trControl = RfoldCV) print(t2, details=T) ## Conditional Inference Random Forest ## ## 891 samples ## 12 predictor ## 2 classes: '0', '1' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 0.7947205 0.5243128 0.03712724 0.09209961 ## 9 0.8287313 0.6262203 0.03717407 0.08255746 ## 16 0.8289648 0.6276099 0.03652213 0.08175874 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 16. ## ## ---------------------------------------------------------- ## ## The final model: ## ## ## Random Forest using Conditional Inference Trees ## ## Number of trees: 500 ## ## Response: .outcome ## Inputs: Pclass2, Pclass3, Sexmale, SibSp, Parch, EmbarkedQ, EmbarkedS, honorificMrs, honorificMaster, honorificMiss, honorificMr, Deckunknown, tix.grpother, Age, Fare, FamSize ## Number of observations: 891

Accuracy is similar to the more complex random forest. Let's look again at model-based variable importance.

auc.imp2 <- varImp(t2, scale=F) plot(auc.imp2)

It looks like those new binary features might be adding some predictive power. On the other hand, there are at least three features that aren't adding much to the model. We could just drop the least important variables, and keep fitting models. But there's another way to choose the 'best' subset of features.

# Feature selection

When a prediction problem involves only a handful of possible features, like this Titanic dataset, the modeler might have sufficient knowledge of the problem to hand select the most relevant predictor variables. But sometimes prediction problems come with tens of thousands of candidate predictors, for example in classification problems involving DNA sequences or large text documents. In these cases it is not usually feasible to hand select the relevant features. Automated feature selection techniques present a potential solution, but they come with some problems of their own–at least in their use and interpretation.

Two main classes of feature selection techniques are 'wrapper' techniques and 'filter' techniques. Here I'll focus on wrapper techniques. Most generally, wrapper techniques evaluate different subsets of available variables–sometimes including the interactions between them–to find a “best” model. If you're in the social or natural sciences, chances are you've already heard of or used one such technique–stepwise regression. In my experience, experimental scientists working with smaller datasets almost exclusively use wrapper techniques, and most commonly, forward stepwise regression. In this technique, a model containing a single predictor plus the intercept is compared to the null model containing only the intercept term. If there are 5 predictors, for example, 5 of these models are fitted. The model that explains the most additional variance in the response variable (or sometimes the variable with the lowest p-value) wins. Then this new model with two terms becomes the base model that is compared to each model containing one of each remaining candidate predictors.

Despite their popularity, stepwise regression methods have been criticized heavily for producing biased estimates, overfitting, and generally being a bit of a joke among professional statisticians. The problem may be most dramatic when no validation data is available, no resampling or cross-validation is used during selection, and the researcher reports estimates or predictions from the final selected model with no mention of how it was selected. Just as we saw in model selection, one way to alleviate some of these concerns is to assess out-of-sample prediction. This `caret`

page addresses the importance of external validation not only for model selection, but for feature selection too:

It is important to realize that feature selection is part of the model building process and, as such, should be externally validated. Just as parameter tuning can result in over-fitting, feature selection can over-fit to the predictors (especially when search wrappers are used). In each of the caret functions for feature selection, the selection process is included in any resampling loops"

One such automatic wrapper feature selection technique is called recursive feature elimination (RFE), which has some parallels to stepwise regression methods. Let's see how RFE might help us find a better predictive model. RFE starts in the opposite position of forward selection methods, by first fitting the saturated model (containing all predictors), and then removing variables one at a time, comparing variable importance and model prediction for each smaller model. But the same issue arises that we saw in model selection, where the best set of predictors for a particular training dataset may not be replicated in a new dataset–hence the need for external validation. Caret provides the ability to nest the RFE procedure in an outer cross validation loop. Here's pseudocode for the algorithm:

**FOR**each resampling iteration:- ….Partition training data into test/train subsets
- ….Fit/tune model with all
*k*predictors to partitioned train data - ….Predict model responses on test data
- ….Note model accuracy and variable importances
- ….
**FOR**each subset size*S*, from 1 …*k*-1 predictors: - ……..Keep
*S*most important predictors - ……..Fit/tune model to train data using
*S*predictors - ……..Predict model responses on test data
- ……..(optional: recompute variable importance)
- ….
**END** **END**- Compute performance across the
*S*using the hold-out test data - Determine appropriate number of predictors
- Estimate final set of predictors to keep in full model
- Fit “best” model to original training data using optimal
*S*

The caret `rfe`

function implements the RFE algorithm. Its first two arguments are a dataframe of predictors and vector or factor of responses, so let's create those.

predictors <- select(train, c(-PassengerId, -Survived)) responses <- (select(train, Survived))$Survived

The function also takes a control object that includes the class of model to fit (and from which variable importance measures will be extracted), the method of resampling, number of folds (if you choose CV), and number of repeats. The caret documentation explains how to use any of the available models (or a custom model), but let's use the functions available in the `randomForest`

package by specifying `rfFuncs`

. The variable importance measure available in this implementation of random forests is based on Gini node impurity, which is a measure of the quality of a split in a decision tree. The greater the homogeneity of a node from that split, the lesser the impurity.

library(randomForest) ctrl <- rfeControl(functions = rfFuncs, method = "repeatedcv", number=10, repeats=10, verbose=FALSE, allowParallel=TRUE)

The `rfe`

function also takes a sizes argument specifying which subset sizes to include.

Let's run the algorithm including all possible subset sizes, and print the output.

rf.profile <- rfe(predictors, responses, sizes = 3:11, rfeControl = ctrl) print(rf.profile) ## ## Recursive feature selection ## ## Outer resampling method: Cross-Validated (10 fold, repeated 10 times) ## ## Resampling performance over subset size: ## ## Variables Accuracy Kappa AccuracySD KappaSD Selected ## 3 0.7942 0.5578 0.03931 0.08761 ## 4 0.8206 0.6118 0.04196 0.09276 ## 5 0.8367 0.6456 0.03750 0.08336 * ## 6 0.8306 0.6326 0.03751 0.08301 ## 7 0.8284 0.6287 0.03746 0.08241 ## 8 0.8293 0.6322 0.03668 0.07946 ## 9 0.8342 0.6429 0.03544 0.07791 ## 10 0.8341 0.6430 0.03363 0.07382 ## 11 0.8330 0.6403 0.03239 0.07127 ## ## The top 5 variables (out of 5): ## honorific, Sex, Fare, Pclass, Age # NOT RUN: # trellis.par.set(caretTheme()) # plot(rf.profile, type = c("g", "o"))

The best number of predictors is shown with a star in the 'Selected' column and can be found in `rf.profile$bestSubset`

. Notice that a simple model with 5 variables also exhibited fairly high accuracy. Let's also plot the variable importance of each of the 'best' predictors–in this case a measure of node impurity from `randomForest`

.

winners <- data.frame(var = rownames(rf.profile$fit$importance), imp = rf.profile$fit$importance) p <- arrange(winners, -winners[2]) barplot(p[,2], names.arg = p[,1], cex.names = .6, cex.axis = .6)

Consistent with the `cforest`

variable importance measure (and the single tree) from earlier, Pclass, Fare, Sex, Honorific, and Age seem to be the key predictors. Let's see if we can build a model with just these 5 predictors that fares as well as the more complex models. But let's leave the forest and enter the realm of a very different type of predictive model.

# Neural network

Artifical neural networks have been around since the 1950s. The basic idea is that inputs are mapped onto output(s) by one or more layers of processing units (neurons). Each unit's output is expressed as a function of the input, the weights of the connections to that unit, and sometimes a constant bias term. The function itself can take any form, but in practice a sigmoid is often used. The weights are learned from repeated exposure to labeled training data.

A single layer perceptron was among the first artificial neural networks. Consider the simplest case when a single processing unit takes a set of inputs, and learns to produce an output for each input. When the activation function is the nonlinear logistic (sigmoid) function, the output is constrained between 0 and 1 and can be interpreted as the probability that the input belongs to one or another class. This simple one unit “network” basically implements logistic regression. But the power of neural nets becomes apparent when one or more hidden layers of processing units are placed between the inputs and output. In a 3-layer perceptron, a hidden layer of processing units act as intermediaries between the input and output. One way to think of the hidden layer units' role is that they represent the inputs in a new way that the output can use to solve the learning problem… even when it is not linearly separable. The standard example is the non-linearly separable XOR problem. A multilayer perceptron can learn a nonlinear mapping via the hidden layer, whereas a single layer perceptron (and likewise logistic regression) cannot.

Let's load the R package 'nnet', which includes the function `nnet`

to fit a multilayer perceptron. First let's confirm that whereas logistic regression can't solve XOR, a 3-layer perceptron with 2 hidden units can.

xor <- data.frame(x1=c(1,1,0,0), x2=c(1,0,1,0), y=c(1,0,0,1)) xor ## x1 x2 y ## 1 1 1 1 ## 2 1 0 0 ## 3 0 1 0 ## 4 0 0 1 # logistic regression does not predict the correct output log.reg <- glm(y ~ x1 + x2, data=xor, family = 'binomial') round(predict(log.reg, type = 'response'), 1) ## 1 2 3 4 ## 0.5 0.5 0.5 0.5 # a multilayer perceptron does library(nnet) set.seed(rand.seed) mlp <- nnet(y ~ x1 + x2, data=xor, size=2) ## # weights: 9 ## initial value 1.000608 ## iter 10 value 0.996987 ## iter 20 value 0.757799 ## iter 30 value 0.601836 ## iter 40 value 0.002959 ## iter 50 value 0.001166 ## iter 60 value 0.000850 ## final value 0.000088 ## converged round(mlp$fitted.values, 1) ## [,1] ## 1 1 ## 2 0 ## 3 0 ## 4 1

Now back to the Titanic dataset. Let's train a multilayer perceptron to map those 5 most important variables we found above, to the binary response. Since the response (coded as a factor) is binary, the `nnet`

function will build a neural net with an output layer consisting of a single output unit (the default activation function is logistic). We get to decide how many hidden units to include in the hidden layer with the 'size' parameter.

set.seed(rand.seed) n.net <- nnet(Survived ~ honorific+Sex+Fare+Pclass+FamSize, data=train, size=2) ## # weights: 23 ## initial value 597.332481 ## iter 10 value 425.155736 ## iter 20 value 363.864709 ## iter 30 value 349.029447 ## iter 40 value 345.704494 ## iter 50 value 343.522282 ## iter 60 value 342.907737 ## iter 70 value 342.751372 ## iter 80 value 342.642992 ## iter 90 value 342.586077 ## iter 100 value 342.383923 ## final value 342.383923 ## stopped after 100 iterations

We can visualize the structure of the network with `NeuralNetTools`

.

library(devtools) source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r') plot.nnet(n.net, bias = F, node.labs = F, cex.val = .8, circle.cex = 2.5)

The positive weights are black, the negative weights are grey, and thicker lines corresponds to larger weights. Just as in logistic regression, the predicted values from the network's output unit are probabilities between 0 and 1. As we did with logistic regression in the first post, we can create the confusion matrix by rounding the outputs to either 0 or 1. How well does this network fit the training data in terms of observed accuracy?

# confusion matrix cm <- as.matrix(table(train$Survived, round(n.net$fitted.values, 0))) # observed accuracy round((cm[1,1] + cm[2,2]) / sum(cm),2) ## [1] 0.83

Actually it's about the same as the more complex random forest models. But how does this feedforward multilayer neural network do as a predictive model? Let's move back into the `caret`

framework, which supports a number of R neural network packages including `nnet`

. The two parameters under control for `nnet`

are 'size' (# hidden units), and decay (weight decay). The weight decay parameter controls the degree to which large weights are penalized. Shrinking the weights (coefficients) towards zero can prevent overfitting. Therefore we might expect the 'optimal' cross validated model to include at least some weight decay.

set.seed(rand.seed) RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=10, allowParallel=TRUE) myGrid <- expand.grid(size=1:3, decay=seq(.01, .1, by=.01)) train.nnet <- train(Survived ~ honorific+Sex+Fare+Pclass+FamSize, data = train, method="nnet", trControl = RfoldCV, tuneGrid = myGrid, trace=F) print(train.nnet, details=TRUE) ## Neural Network ## ## 891 samples ## 12 predictor ## 2 classes: '0', '1' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... ## Resampling results across tuning parameters: ## ## size decay Accuracy Kappa Accuracy SD Kappa SD ## 1 0.01 0.8194152 0.6057693 0.03550613 0.08019971 ## 1 0.02 0.8198685 0.6066374 0.03527554 0.07959338 ## 1 0.03 0.8207649 0.6086724 0.03529387 0.07948190 ## 1 0.04 0.8209883 0.6091715 0.03544306 0.07971195 ## 1 0.05 0.8210982 0.6095437 0.03625613 0.08130311 ## 1 0.06 0.8215451 0.6106394 0.03596764 0.08050056 ## 1 0.07 0.8213216 0.6102566 0.03564968 0.07972147 ## 1 0.08 0.8214315 0.6105840 0.03617944 0.08099885 ## 1 0.09 0.8218809 0.6117630 0.03576466 0.07971096 ## 1 0.10 0.8215438 0.6110636 0.03580264 0.07988065 ## 2 0.01 0.8220033 0.6126249 0.03703068 0.08345254 ## 2 0.02 0.8196475 0.6073290 0.03687800 0.08277538 ## 2 0.03 0.8206537 0.6095724 0.03618922 0.08196159 ## 2 0.04 0.8214302 0.6110706 0.03669082 0.08208453 ## 2 0.05 0.8216562 0.6115778 0.03687526 0.08289307 ## 2 0.06 0.8198609 0.6071371 0.03586885 0.08137523 ## 2 0.07 0.8226599 0.6135260 0.03499636 0.07936349 ## 2 0.08 0.8208759 0.6096534 0.03560528 0.07914850 ## 2 0.09 0.8231219 0.6142658 0.03416244 0.07754455 ## 2 0.10 0.8208772 0.6089246 0.03474508 0.07888405 ## 3 0.01 0.8169483 0.6036379 0.03602721 0.08023585 ## 3 0.02 0.8166061 0.6020818 0.03670291 0.08157090 ## 3 0.03 0.8202043 0.6106096 0.03691266 0.08147574 ## 3 0.04 0.8183964 0.6059040 0.03556919 0.07989092 ## 3 0.05 0.8195226 0.6087452 0.03654531 0.08026656 ## 3 0.06 0.8193067 0.6083571 0.03599841 0.08035233 ## 3 0.07 0.8196262 0.6086705 0.03706570 0.08278153 ## 3 0.08 0.8200794 0.6090686 0.03586894 0.08112879 ## 3 0.09 0.8198521 0.6090701 0.03741066 0.08343753 ## 3 0.10 0.8197473 0.6090644 0.03528446 0.07860332 ## ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were size = 2 and decay = 0.09. ## ## ---------------------------------------------------------- ## ## The final model: ## ## a 9-2-1 network with 23 weights ## options were - entropy fitting decay=0.09 ## b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 i9->h1 ## -0.63 -0.39 -1.50 -0.02 0.72 -0.08 -0.94 -1.40 2.84 0.85 ## b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2 i9->h2 ## -2.77 -0.23 -3.31 -0.91 0.15 2.92 0.05 1.52 -1.21 -0.11 ## b->o h1->o h2->o ## 3.41 -5.05 -6.32 ggplot(train.nnet)

Looking at the parameter plot, there's a general pattern of increasing cross validated accuracy with increasing weight decay, which we might have expected given the negative relationship between parameter penalization and overfitting. It also looks like even 3 hidden units is too many. Let's fit this neural net to the test data and submit the predicted responses to Kaggle.

Survived <- predict(train.nnet, test) submission <- data.frame(PassengerId = test$PassengerId, Survived=Survived) write.csv(submission, 'nnet_5-predictors.csv', row.names = FALSE)

This submission scored 78.9%; better than some models I've submitted so far, but worse than the best random forest from the 2nd post. Remember that this neural net only had the 5 'most important' features to choose from, but according to the feature selection technique results, there were 9 predictors in the best model. Let's tune another set of nets using these 9 best predictors (i.e., all of them except Parch and SibSp), but also expanding the range of the weight decay parameter.

set.seed(rand.seed) RfoldCV <- trainControl(method="repeatedcv", number=10, repeats=10, allowParallel=TRUE) myGrid <- expand.grid(size=1:3, decay=seq(.05, .5, by=.05)) train.2.nnet <- train(Survived ~ . -PassengerId -SibSp -Parch, data = train, method="nnet", trControl = RfoldCV, tuneGrid = myGrid, trace=F) print(train.2.nnet, details=TRUE) ## Neural Network ## ## 891 samples ## 12 predictor ## 2 classes: '0', '1' ## ## No pre-processing ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 802, 802, 802, 802, 802, 802, ... ## Resampling results across tuning parameters: ## ## size decay Accuracy Kappa Accuracy SD Kappa SD ## 1 0.05 0.8283966 0.6262284 0.03666416 0.08169154 ## 1 0.10 0.8296389 0.6289615 0.03638087 0.08101925 ## 1 0.15 0.8304267 0.6308685 0.03662570 0.08165744 ## 1 0.20 0.8306514 0.6313702 0.03670954 0.08176644 ## 1 0.25 0.8305403 0.6314395 0.03722916 0.08277337 ## 1 0.30 0.8298661 0.6301451 0.03761542 0.08372796 ## 1 0.35 0.8297588 0.6299828 0.03735633 0.08332867 ## 1 0.40 0.8295354 0.6296764 0.03798923 0.08459876 ## 1 0.45 0.8289748 0.6285396 0.03792286 0.08443648 ## 1 0.50 0.8289748 0.6287698 0.03819255 0.08504797 ## 2 0.05 0.8291869 0.6270824 0.03895579 0.08725155 ## 2 0.10 0.8287336 0.6257477 0.03533538 0.07927731 ## 2 0.15 0.8304166 0.6284736 0.03759343 0.08497172 ## 2 0.20 0.8306502 0.6299061 0.03626282 0.08130310 ## 2 0.25 0.8321058 0.6334814 0.03711895 0.08284028 ## 2 0.30 0.8312158 0.6317397 0.03844051 0.08602783 ## 2 0.35 0.8305253 0.6308421 0.03672171 0.08233385 ## 2 0.40 0.8321058 0.6346600 0.03758149 0.08403713 ## 2 0.45 0.8301919 0.6306094 0.03731799 0.08322976 ## 2 0.50 0.8300845 0.6308015 0.03805407 0.08470974 ## 3 0.05 0.8258236 0.6197355 0.03714375 0.08402461 ## 3 0.10 0.8269446 0.6220689 0.03564389 0.08152226 ## 3 0.15 0.8300657 0.6280479 0.03795216 0.08591197 ## 3 0.20 0.8313104 0.6315346 0.03727429 0.08440573 ## 3 0.25 0.8320983 0.6331346 0.03785046 0.08506085 ## 3 0.30 0.8291869 0.6274166 0.03656718 0.08187601 ## 3 0.35 0.8297538 0.6290897 0.03825896 0.08519744 ## 3 0.40 0.8290684 0.6282978 0.03595013 0.08035171 ## 3 0.45 0.8292879 0.6285565 0.03736792 0.08317332 ## 3 0.50 0.8289684 0.6283507 0.03901034 0.08642523 ## ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were size = 2 and decay = 0.4. ## ## ---------------------------------------------------------- ## ## The final model: ## ## a 14-2-1 network with 33 weights ## options were - entropy fitting decay=0.4 ## b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 ## -0.51 1.06 0.38 0.36 0.09 0.35 -0.31 -0.95 -0.23 ## i9->h1 i10->h1 i11->h1 i12->h1 i13->h1 i14->h1 ## 0.39 1.26 0.18 0.07 -1.40 0.95 ## b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2 ## 1.08 0.96 -0.76 -1.82 0.06 -0.46 1.19 2.27 0.49 ## i9->h2 i10->h2 i11->h2 i12->h2 i13->h2 i14->h2 ## -1.68 -0.42 -0.60 -0.01 0.10 -0.20 ## b->o h1->o h2->o ## -0.25 -2.51 4.71 ggplot(train.2.nnet)

This time the sweet spot of weight decay appears to be between about .2 and .4. The best network had 2 hidden units and used weight decay of 0.25. It's interesting that even quite large weight decay (up to .5) doesn't seem to impair validated accuracy very much.

Okay, here goes my last Kaggle submission. The results are … that this network's predictions scored a 78.5%; no real change from the simpler network, and worse than my best submission from the previous post. But I don't care, and here's why.

# The best model?

Maybe it would've been nice to end with my *best* submission to Kaggle… but the thing is, maybe I did and I just don't know it. Kaggle holds back two validation sets–one that is used for the leaderboard and one that isn't revealed until the end of the contest. Although it may be tempting to take the Kaggle scores on the public leaderboard as arbiters of pure unbiased classification performance (…the website spins, you take a breath, and then you get a quick hit of feedback…), they are not. If you were to submit 10 models a day, each tuned a little differently to carve out small gains on the leaderboard, then you have almost surely created a biased classifier. That is, you've fostered dependence between the classifier and the test data, and your classifier is more likely to perform poorly on new (previously unseen) validation data. A quick google search reveals a few proposed solutions to the leaderboard issue, including this one. Beyond prediction competitions, the leaderboard bias is related to a more general, and (as far as I can tell) unresolved issue in predictive modeling.

The issue is how to use a subset of data (OLD), to make the most accurate predictions about new data (NEW), where both datasets were generated by the same population (P). In practice, OLD and NEW are often random subsets of an existing dataset. For example the Titanic data we've been working with are OLD, the leaderboard test data are NEW, and both are subsets of the entire passenger list (P). An implicit assumption that I've been making in these posts is that estimating a model's prediction accuracy on OLD by averaging cross-validated out-of-sample accuracy estimates is better than no cross-validation, and repeated k-fold cross-validation is better than a single run of cross-validation. By better, I mean a less biased estimate of a given model's accuracy in predicting P (not just NEW). But these assumptions may not be valid.

Vanwinckelen and Blockeel (2012) showed in simulation experiments on several large datasets that increasing the repetitions of cross-validation (increasing the number of prediction accuracy estimates to aggregate) on OLD does not converge on the true / population accuracy (i.e., using the whole dataset, or P). Although the confidence intervals of the estimates did shrink with increasing repetition (i.e., the variance of the CV estimator decreased), these tighter intervals were shown to be *less* likely to contain the population accuracy. The authors also provide evidence that since the training data is always smaller than the full dataset during cross-validation, the accuracy estimates exhibit pessimistic bias–they underestimate the true accuracy. This is just one paper of many in this area, and their findings may not generalize to all cases. But it highlights the difficulty of answering certain questions. How do I know if my model is the best model? How accurate are my model's predictions? The answer of course is that you can't *know*, you can only estimate and revise.