Machine learning in R with caret tutorial
In this post you'll learn how to use caret to perform stratified sampling, pre-process your data ready for modelling, conduct resampling to get robust performance estimates and build regression and classification models using a range of different algorithms.
Contents
-
Transforming our data: centring, scaling and other transformations
-
Different methods of resampling to get robust performance estimates
-
Different modelling algorithms: glmnet, random forests, xgboost, SVMs, KNN, etc
-
Next steps: hyperparameter tuning, model stacking and feature selection
What is caret?
The caret package was created by Max Kuhn when he was at Pfizer as a way of creating a unified interface for working with all the different modelling packages in R. It contains lots of helpful tools for each section of the modelling process, helping us to build powerful and robust modelling pipelines. The documentation is also extremely thorough which you can check out here.
"The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models." - caret documentation
Today's post is the first in this miniseries on caret. The hope is that by the end of this tutorial you'll be able to build your own powerful regression and classification models using caret in a way that avoids data leakage and gives you robust measure of your model performance. Topics we'll cover include:
-
Data splitting whilst preserving important distributions (stratified sampling)
-
Pre-processing our data in a way that avoids data leakage
-
Getting robust estimates of model performance using different resampling techniques
-
Training and comparing different modelling algorithms
-
Calculating and understanding variable importance estimations
The next set of posts will then build on these learnings and focus on how we can try and increase our model performance even more by covering:
-
Hyperparameter tuning
-
Model stacking and ensembling
-
Feature selection
We'll be using the same data set throughout which is the diamonds data set from the dplyr package.
The diamonds data set
Let's go ahead and load all the packages we'll need today. Caret comes pre-loaded with some modelling packages but any extra ones you might want to use you'll get a 'would you like to install' message when you try to run them. We'll be using the 'diamonds' data set that comes as part of the dplyr package to create some mock training data. We'll try to predict the price of the diamond based on its physical characteristics. If we call ?diamonds we can see that it is 'A dataset containing the prices and other attributes of almost 54,000 diamonds' as well as descriptions for each of the variables. We can have a quick look at the data by calling head() to print the first few rows and summary() to get a quick summary of each of the columns:
It looks like we've got a mix of continuous (carat, depth, table, price, x, y and z) and categorical (cut, color, clarity) data and also from our summary it looks like we've got some unrealistic values for some of our diamond sizes. Each of x, y and z are meant to record various diamond dimensions in mm so having values of 0mm should be impossible. Its likely that these values are in fact missing/unrecorded and have been defaulted to 0. We can correct this by replacing any 0mm dimensions with NA.
We also appear to have some very large diamonds which again could be indicative of some data entry errors. A quick bit of googling suggests the largest, clear diamond in the world, the Culinan 1, measures 5.89cm × 4.54cm × 2.77cm weighing in at 530 carats. So to get some diamonds in our data that are supposedly 5.89cm wide or 3.1cm deep suggests these are likely to be data entry problems. This becomes even more likely when we see that the largest diamond is supposedly only 5 carats and the most expensive is $18.8k.
It's worth mentioning that we're flagging and removing these outliers based on subject matter expertise to identify data entry issues i.e. incorrectly recorded data. This is different to identifying and removing outliers that are genuine data points but just larger/smaller/rarer compared to the rest of our data. It can be tempting to remove these kinds of outliers too but great care should be taken as outliers are data points too! There's a good discussion here about removing outliers. At the very least, removing outliers to try and improve our model counts as pre-processing our data and needs to be done as part of our wider model build and resampling scheme (more on this later).
For the sake of creating some extra missing data to use in this tutorial, let's pretend we know that one of our brokers occasionally uses the size measurements of the larger, uncut diamond by mistake. Therefore any time we a diamond with a physical dimension over 1cm or carat size over 3 we know it's a data entry issue and not to be trusted. We could remove these rows but assuming all the other the values (cut, color, etc) can be trusted we'll simply set them to NA and replace them with something more sensible later on.
As we'll be imputing (replacing the missing values with actual values) it can be good practice to create some new columns that record which rows had missing data. This can be particularly useful if we think the process that caused the data to be missing might be significant in our model e.g. the broker who entered the measurements from the uncut diamonds might also have recorded the price for the uncut stone which would generally be cheaper than an equivalent cut diamond.
We can see now that we have a few missing values in our physical dimension features and have four new columns that record which rows have missing data.
As we'll be calculating feature importance estimates later on, let's add some random noise variables to our data to see if our model can correctly identify them as having no importance. They'll also be a good test later on when we try some different feature selection techniques. We'll see during pre-processing how we can also remove variables that have little variation so let's add a column that is 99% populated with 0s too to demonstrate this:
Usually when building a model we'd want to do lots of feature engineering to try and create new variables for the model to use. As this tutorial is more about how to use caret I'm skipping this step but Max has also written an entire book on the subject that's available for free online. Now we've finishing mocking up our data, we can start using caret to build our modelling workflow.
Splitting our data
We can split our data into Train and Test using caret's createDataPartition function. We pass it the target column (i.e. price) as a vector, specify the percentage of the data we want to use in our training data set with 'p=' and then add 'list=FALSE' so we don't get a list back. We can then use these indices to subset our initial data set to create a Train and Test set.
One thing worth noting is that caret creates the splits using stratified sampling. This means if we have a categorical target we'd keep the overall class distribution between Train and Test e.g. if we our target had 50% of values as 'Yes' and 50% as 'No' we'd want each of our Train and Test to also have a 50:50 split between 'Yes' and 'No'. For continuous targets like we're using, caret converts our values into percentile groups and then samples our chosen proportion from each percentile group. This way we should end up with a matching distribution of our continuous target between Train and Test.
Normally I'd take a 20% or 10% sample to be held out as a Test set and train the model on 80%-90% of the data. For the purpose of this tutorial I'm going to flip that and only have 10% of the data in Train. This is just to speed up the running of things as the data set is >50k rows. We'll still be training on ~5k rows of data which is enough to get a robust and predictive model at the end.
We can see that the quartiles, mean and median are pretty much identical between our Train and Test data sets which is great. If we wanted to check in a more visual way we can create some box plots:
Those look pretty good!
Unfortunately caret is only able to stratify on one column when sampling. If we think there are other important groups in our data that we want to preserve the distribution of we can use dplyr to sample across multiple columns. It's worth noting though that the more columns we sample across (particularly if some contain very small groups) the harder it might be to exactly preserve the underlying distribution of our original target column. Below is an example (which we won't use) of sampling on both the continuous price column (binned into 10 groups) and also the clarity variable.
Explore the data
We've already had a bit of a sneak peak at some of the distributions in our data when we were removing our data entry issues. Now we've split our data into Train and Test, let's have a more thorough look at how each of the different input variables are distributed. It's important to do this after splitting our data as the Test set is meant to represent unseen, new data and so we can't use it to draw conclusions about our data as a whole as in theory it doesn't exist within our workflow. I like to think of the Test set as representing tomorrow's future data that we'll have when come to deploy our model but doesn't exist today.
We can use the skim() function from the skimr package to get a very comprehensive overview of our data. This will help us understand what, if any, transformations we might might want to apply to our data further down the line:
From this we can see that nearly all the columns are complete and even those where we replaced the data errors with NA are still over 99% complete. Our price and size columns (x, y, z) all have tails of larger/more valuable diamonds. We can see this from the histograms and also from the fact that the mean value of the columns is higher than the median (p50). We can also see that our randomly generated variables do seem to be completely random. We can have a more thorough look at our data by creating some more detailed plots. First we'll split the numeric and categorical/binary columns off separately:
For our categorical columns, we're selecting anything that's not numeric as well as our binary flags that we created earlier. We could turn the binary flags into factors but some modelling packages only accept binary data as numbers so we'll leave them as numeric for now. For our continuous data, we're taking anything not in our categorical data and dropping the price column.
For our categorical data we can use box-plots to compare how the value of our target (price) varies for different categories within each of our columns. We can do this by creating a function that calls ggplot to create a box-plot and then pass the function to purrr to map over our vector of categorical columns:
There's quite a lot to take in with these plots! Firstly, it looks like there's something odd going on with the relationship between price and factors we'd normally think of as giving a good indicator as to the quality of the diamond. For example, the average value of a 'Fair' cut diamond is higher than one with an 'Ideal' cut. The same happens for colour. It looks like as the colour gets worse, the price goes up and the same for clarity too. We'll dig into this a bit more later on.
For our other factors, we can see our randomly generated low_var variable doesn't seem to show any great different in average price which is to be expected. We can see as well that our missing flags for carat, x and y are nearly all entirely populated by 0s making them redundant so we'll want to remove those before we train our model. Where they are populated the average price is a lot higher which makes sense as it was the larger diamond sizes that we replaced with NA.
Let's repeat the process for our continuous data by plotting a simple scatter plot. We can add a trend line to our plot to see what possible relationships there might be between our column and target and potentially flag any transformations we might need to do to our data:
Again we've got a lot of information in these charts. It looks like there's a strong, positive trend between price and carat which makes sense as we'd expect larger diamonds to cost more. The trend is pretty linear to start with but then flattens off when we get to the really big diamonds. For depth there's not a super obvious trend. Maybe for values from 55-60 it's weakly negative and then from 60-65 it's becomes slightly positive. Table is similar, looking like a positive trend between 55-60 but then flat after that. Variables where we don't have a clear linear trend or where we have a combination of different trends across different values of the same variable might mean linear models struggle with these and so we might want to try a different type of model to best capture the relationships contained within them.
The size in mm variables 'x', 'y' and 'z' all show very similar trends and all look similar to what we saw with carat. This makes sense as they measure the physical dimensions of the diamond which will then be captured by the carat/overall diamond size. Out of the three dimensions 'z' (actual depth in mm) looks to have the steepest curve suggesting diamonds become more costly per mm gained in depth than they do by getting wider. Finally, our last 4 noise variables show that they truly are random with a flat trend line amongst a random scattering of values.
Having seen the strength of the relationship between carat and price this might explain the strange trends we saw earlier with our categorical data. We'd expect bigger and better quality diamonds to cost more. We've seen that this holds in our data for larger diamonds but the better quality ones seemed to be cheaper which was odd. Could it be that size trumps quality when it comes to valuing diamonds? We can test this by replotting our categorical features and controlling in some way for diamond size. We'll create 4 buckets based on the carat variable and then replot our data for cut, colour and clarity within each of these groups. This will show us how price varies for different levels of our categorical features for diamonds that are a similar size:
These looked more like the trends we expected to see. The carat groups range from 1 (smallest diamonds) to 4 (largest) and we can see that for the smallest and largest carat groups, a better cut does seem associated with a higher price although it's less clear for the groups in the middle sizes. Likewise a better colour (D) seems to make for a more expensive diamond within each of the different carat sized groups. Possibly the most obvious trend is with clarity where there's a positive trend between price and clarity quality within each of the groups.
Now we've solved the mystery of our quality vs price plots is there anything we can do to help our model with the continuous data? Having seen the distribution of price from our skimr plot we know that most of our diamonds are at the cheaper end of the distribution but we do have a long tail of some more expensive ones. Often with distributions like this it can be helpful to take the log() of the variable to try and give a more normal looking distribution.
Although not quite a bell curve, our data now looks a lot more symmetrical which can be helpful to some of the algorithms we'll want to use. Later on we'll see how we can also use caret to transform our input features to make them more appropriate for our models without introducing data leakage into our pipeline. We're ok to transform our target with a log before any resampling as the transformation parameters don't depend on the data they're applied to unlike, for example, centering or scaling our data.
Checking for correlations between inputs
We've seen already that we've got a few different ways of measuring the size of the diamond in our data e.g. carat, x,y and z so it's reasonable to expect there to be a strong correlation between them. Some models can struggle with correlated inputs e.g. it can cause linear regression to have unstable coefficient estimates or reduce the estimated variable importance across the correlated features in a random forest. Caret has the findCorrelation() function where we can pass it a correlation matrix and each possible pair of variables is considered and any with a high correlation are logged. It then looks at the total average, absolute correlation for each feature and suggests removing the one with the highest overall correlation in the data set. Let's try it on our data to see what it recommends. First we'll create our correlation matrix after removing any missing values and also excluding the target columns:
We can see that, as expected, x, y and z all have very high correlations to carat. Now we can run caret's findCorrelation() to see which variables it suggests removing:
It looks like it flags 'z' (row 6) with 'x' (column 4) to start with and calculates that 'z' has the highest aggregate correlation so flags it for removal. Next it looks at 'x' vs 'y' and recommends removing 'x' before looking at 'y' vs 'carat' and recommends removing 'y'. This is then summarised at the end for us with 'z', 'x' and 'y' all returned as recommended for removal. This is a handy step to run to forewarn us about excessive correlation between inputs. We'll see later how we have the option to automatically remove these as part of any pre-processing steps we run when we train the model.
Create dummy or one-hot encoded variables
As well as looking at the correlation between continuous features, caret also has some helpful tools for working with our categorical data. Whilst some of the modelling algorithms in R are happy to accept factors as inputs (e.g. ranger and cubist) a lot of them require categorical data to be presented as dummy or one-hot encoded features. We can do this easily in caret use the dummyVars() function.
We get a useful summary here that tells us caret has identified 3 factor columns and it'll be these that we create dummy variables for. The 'less than full rank encoding is used' is also relevant as this means we'll get a new column for every value of our factor which might be undesirable. Depending on which model we're using we might want 1 column fewer than we have levels in our data (i.e. a full rank encoding). This is typically the case for regression models with no regularisation. Let's run our dummyVars as is to see what we get returned. We create the dummy variable by calling predict() on our data with our newly created 'create_dummy' object:
Two things are immediately obvious here: 1) our dummy variable names are a bit unintuitive and 2) they're not dummy variables! This is because our factors were ordered and so caret tries to capture this ordinality when it creates the new encodings. You can read more about it in this blog post from Max here.
A less obvious outcome is also that our target column price_log has disappeared too. This is standard behaviour and so we'll need to merge it back on. To keep things simple if we just want a regular one-hot encoded transformation we can remove the ordering from our factor columns. Caret renames the columns with . instead of _ and also keeps any whitespace so we can add a step to remove this too. This time we'll also specify that we want a full rank encoding:
This looks more like what we were after! We've got our price_log target, the columns are 0/1 and we can see that we're missing one level per factor e.g. there's no color_D. One consequence of one-hot or dummy encoding is that we can create a lot of new columns so you want to avoid using it on variables with lots of levels e.g. post codes. If you have lots of classes you could consider collapsing them into fewer classes before encoding.
Check for variables with zero or near-zero variance
As well as creating lots of new columns, we can see from the print out that some columns have hardly any 1s in at all and are mostly 0s. Columns that are very imbalanced (where nearly all their values are just one value) are said to have near-zero variance. Columns that have just one value for all rows are said to have zero variance. Zero variance columns can cause some algorithms to crash and a concern for near-zero variance columns is that when we do resampling we could easily end up with a fold or resample where our near-zero variance becomes zero just through bad luck. We can check for zero and near-zero variance features using caret's handy nearZeroVar() function:
The 'freRatio' tells us how imbalanced our data might be by dividing the number of occurrences of the most common value over the number of occurrences of the second most common value. We'd expect this to be close to 1 when we have a good distribution of values/classes in our column. So we can see for 'cut_Good' that one class of our dummy variable occurs roughly 10x more commonly than the other. The 'percentUnique' captures how many unique values there are / number of rows. If the data is highly imbalanced and there aren't many unique values then caret will flag those columns for removal. By default caret flags columns with a freqRatio>19 and percentUnique<10. Let's remove these from our data:
For our data we're removing: all of the 'had_missing' flags which makes sense as these are nearly all 0s, our artificial 'low_var' feature which is what we'd hoped would happen and also the 'clarity_IF' column. The 'color_j' column very nearly meets the threshold so we might expect it to be removed when we do our resampling. As we don't have any zero variance columns we shouldn't need to worry about any of our models crashing and we'll see how we can automatically remove any near-zero variance features that occur from resampling as part of our pre-processing steps when we come to train our models.
Pre-processing and avoiding data leakage
We've already done some pre-processing of our data by logging our target and creating dummy variables. These are all fine to do outside of our model training/resampling process as they don't leak any information. For example the logged value of price doesn't depend on what other data is in the data set. Ideally we'd remove our near zero variance features inside our resampling process so I've jumped the gun a little bit by removing them already. This is because they can interfere with other transformations that I'd like demonstrate in this section.
Let's now consider what to do about the missing values in our data. Ideally we want to replace them with something sensible that our model can use. A simple way to do this would be to replace the missing data with the mean value of the column. However the precise value of the mean we use will depend on what data we have in our data set at the time. Remember that we take a separate Test set to replicate testing our model on new data i.e. we want to see how it performs on data not included in the model training process. However we also want a way of measuring performance across different models or data transformations before picking our final model.
The usual way to do this is to use resampling, either say using k-folds or bootstrapping, where each held out resample takes on the role of a mini test set. To avoid the confusion of talking about test sets taken from Train, Max introduces the convention of Analysis and Assessment sets when talking about resampling. The resamples of Train that the model is built on are the Analysis sets and the held out resamples for measurement the Assessment sets. Below is an example of how a 5-fold cross validation resampling scheme looks:
To get a robust measure of model performance when resampling we need to replicate testing our model on data that hasn't been in the training process otherwise we're leaking data from our Assessment set into our Analysis set. This means any pre-processing we do has to be done on the Analysis set only which means we can't just blanket apply any transformations to our whole Train data set.
One way to think of it is we can't do any pre-processing upstream on the Train as it's then impossible to separate this out further downstream in the resamples. To make it more obvious what's happening we can recolour our Train to reflect the fact it contains both Analysis and Assessment sets upstream and see how this affects calculating the average value we want to impute with:
The mean value for the whole of Train is 0.6 but we can see that actually the mean for our Analysis set is 0.55 and for our Assessment it's 0.8. If we'd used the 0.6 average value from Train for our imputations it'd be closer to the 0.8 of our Assessment and so our model would likely seem to perform better. However it's only appearing to perform better because we've inadvertently fed it information from the Assessment set which isn't going to be possible when we come to predict on the Test or future data. What we actually want to use for imputing missing values is the 0.55 from the Analysis set only. This might make the model performance appear worse but it gives us a more accurate measure of how we'd expect our model to perform on new data.
"In order for any resampling scheme to produce performance estimates that generalize to new data, it must contain all of the steps in the modelling process that could significantly affect the model’s effectiveness...preprocessing or filtering steps (such as PCA signal extraction, predictor correlation filters, feature selection methods) must be part of the resampling process"
- Feature Engineering and Selection: 3.4.7 What Should Be Included Inside of Resampling?
In his book Max also gives a few examples where you can bend these rules (e.g. imputing a few rows amongst many thousands or centring and scaling based just on the Train) and it's a judgement call if the chance/size of the bias introduced is acceptably small relative to the overall size of the data. For now though, we'll process our data in the formally correct way. Thankfully caret makes this easy to do with the preProcess() function. It might be worth calling ?preProcess as there are a lot of different options. Below is a summary of the methods I find myself using most often. You can find a full list with examples in the caret documentation.
I'll run through each of these in turn so we can see how they change our data. We won't actually be overwriting our data with any of these at the moment as we'll wait to incorporate them into our resampling scheme. For now as we've already seen how caret identifies correlated and zero/near-zero variance features let's jump straight to imputing missing values.
Imputing missing values
There are 3 methods for imputing missing data in caret. The quickest and simplest is 'medianImpute' which simply replaces missing numerical data with the median from the column. Next up is 'knnImpute' which takes an arbitrary sample, find the nearest neighbours to our missing data point and then uses the average value from those neighbours. One thing to note with this method is that to calculate accurate nearest neighbours it automatically calls two other pre-processing steps: centre and scale to normalise the data at the same time. Finally there's 'bagImpute' which creates bagged trees to try and estimate the missing values in the column using all the other columns in the data set.
One thing to note is that caret calculates the imputations for all columns in the data set and not just the ones with missing values. This is helpful insofar as if we had data missing in the Test set we already have our imputation ready to use. The downside is it can take a long time to run especially if you have lots of columns in your data. Let's have a look at each of them in turn:
In theory, bagImpute should give us a more accurate estimate of our missing values but as you can see it is significantly slower to run that the other two methods. Remember as well that any transformation has to be calculated per resample which can increase training time significantly. Another option if you have lots of data is to 'spend' some of it by taking a separate sample from Train that is purely used to estimate data transformation parameters and then not used in the subsequent model build process. As the transformation sample doesn't get used in the final build process we avoid leaking any data. Both the resamples and Test also get transformed using parameters from the sample data so we get a fair estimation of performance as well. We'll see an example of how to do this later on.
Transformations
As we've already seen when calculating knnImpute some models prefer to have their data transformed e.g. normalised. The tidymodels team have a useful cheat sheet for which types of models benefit from different pre-processing steps. Caret offers a few different data transformations namely:
-
center - subtracts the mean of the column from each row
-
scale - divides each row by the standard deviation of the column
-
range - scales the data to be within rangeBounds (typically 0-1)
-
BoxCox - uses maximum likelihood estimation to estimate a transformation parameter to give the variable a normal shape (only works on values >0)
-
YeoJohnson - like box cox but can handle negative and 0 values
The easiest way to see how these transform the data is to plot the before and after results of the transformations.
We can see our data ranges from just below 4 to just over 10 and looks to have two peaks. It also looks to have a long tail of higher values. Let's apply some transformations and see how they change things:
Centre and Scale haven't changed the shape of the data but the scales are different. Centre subtracts the column mean from each row and we can see that some values are now negative. The mean value of the column is now centred at 0 so we can see that our data is skewed to the right. The scale function also leaves the shape unchanged but again our scale changes. Scale divides each row by the standard deviation of the column. Often we'll want to apply both these transformations at the same time so that we standardise our data. The sklearn documentation has some great examples of why rescaling data can be so important.
By subtracting the mean and dividing by the standard deviation we ensure every column is on the same scale for our model to use. The chart on the left shows the output from centring and scaling where the mean=0 and each point is expressed in terms of the number of standard deviations it is away from the mean. Again the shape is unchanged. Another way to achieve a similar result is to simply rescale the range of values so they all sit between 0 and 1 which is what the 'range' transformation does on the right. Last up are our Box-Cox and Yeo-Johnson transformations which do change the shape of our data:
The new shapes the transformations create are very similar to each other as both transformations are trying to do similar things i.e. transform our data to be more 'normal' in shape. We can see our data does look more symmetrical now rather than having the long tail of higher values like before. Box-Cox tends to be quicker to run but can't handle negative or 0 values whereas Yeo-Johnson can. We can also see that the scales are slightly different between them.
Principal Component Analysis (PCA)
Principal component analysis (PCA) takes our initial features and replaces them with linear combinations of themselves that capture as much information as possible from the original data. The new principal components are also uncorrelated with each other which makes them handy for machine learning. As they can capture and combine highly correlated and also lower variance features they can be a useful alternative to removing such features.
The same number of principal components are created as we have features in our data but the first component captures most of the variance in our data, the second component the second most and so on. This diminishing return of information means we can usually capture most of the information in our data with fewer principal components than we had features. As such, principal component analysis is often referred to as a 'dimensionality reduction' (fewer features) technique.
As PCA uses linear combinations of features our data needs to scaled first. Caret does this automatically for us by running 'center' and 'scale' pre-processing whenever PCA is used. The downside to PCA is we lose some interpretability from our model as our principal components are a mix of lots of different features and have names like PCA1, PCA2, etc. We can get a rough idea of what each of our principal components is representing by looking at the factor loadings which shows the correlations between our initial features and each of the new principal components.
When performing PCA in caret, we have two options for telling caret how may PCAs to bring back. Either a set number using 'pcaComp' or however many are needed to hit a certain variance explained threshold using 'thresh':
We can access the principal component loadings from our created object with '$rotation'. These tell us how correlated each of our original features is with each of the newly created principal components:
'Carat' looks to have a strong correlation with PC1 so maybe it's captured something to do with the size of the diamond? We know that all of 'x', 'y', 'z' and 'carat' had strong correlations so we might expect them all to have strong correlations with PC1. We can check this more easily by potting the loadings:
I'll just post the first 2 plots here to save space. It does indeed look like PC1 captures diamond size for us with all of 'x', 'y', 'z' and 'carat' having positive correlations with it. For PC2 it's a bit less obvious but it looks like it might be to do with 'higher quality' characteristics insofar as things like 'Premium' and 'Very Good' for 'cut' and 'color_E' have stronger, positive correlations but then 'Ideal cut' (the best possible) has a strong negative correlation! Maybe this is capturing 'higher quality but not the best' type relationships. This is one of the challenges of interpreting the outputs when performing PCA.
Training models using train()
Now we're ready to train our first model! As mentioned at the start, the main rationale for the caret package is to provide a consistent and simple interface for working with all the different machine learning algorithms available in R. Caret does this via the train() function. We simply pass it our data and any of the other options we want to use e.g. pre-processing steps, resampling methods, which loss metric to use, etc. and caret does the rest. Let’s go ahead and build a really simple linear regression model to start with. We won’t do any pre-processing at this point and we'll just use all the default options from caret for model assessment.
There are 3 arguments we need to pass to train() to get it to run: our model equation, this can either be in the formula style of y ~ x1 + x2 etc or we can pass 'y=' as a vector of our target column and 'x=' as a data frame of our input columns. The formula specification apparently creates some overheads behind the scenes but I prefer it so will be using it in this post. Next we tell caret which data set to use by specifying 'data=' and finally we tell caret what modelling algorithm to call using 'method='. As we’re building a simple linear model we’ll tell caret to use ‘lm’. The set.seed() option before the train just ensures reproducibility. Ok, let’s try running our first model!
Oh, well that was anticlimactic. This is because we've got missing data and by default caret errors when it finds missing data (it does the same when predicting on new data too). We'll see how we can impute missing data later as part of our pre-processing steps but for now we'll tell caret to omit/remove any rows with missing data as we've only got a few.
Success! Congratulations, you’ve just run your first model in caret. Looking at the output, it looks pretty good too. The R-squared of 0.97 suggests our model’s predictions correlate really well with the target and our MAE suggests our predictions are within about 12% of the actual diamond prices. MAE is handy to use when we’ve logged our target as for lowish values of MAE it can roughly be interpreted as the mean absolute percentage error of the original, unlogged data. There's a nice stack overflow answer with a bit more maths if you're interested too.
What else can we see from the results? We can see we ran a linear regression which is what we wanted. The model had 5,395 rows of data and 27 features to work with. We didn't do any pre-processing but caret automatically performed some resampling for us to generate the accuracy measures. One of the cool things about caret is that not only does it train the model for us, it also takes steps to ensure that we have robust model estimates as well. We can see that it used 'Bootstrapping with 25 reps'. What does this mean?
The bootstrapping refers to the default resampling method caret uses to measure model performance. It takes a random sample with replacement from our Training data to create a new Analysis set of the same size. Any data not included in the Analysis sample becomes our Assessment set. A model is trained on the Analysis set and its error computed on the Assessment set. This process is repeated 25 times across different bootstrap samples. The final model performance estimate reported is the average performance from across all the resamples (in this case the 25 bootstrap) samples.
So although we only told caret to build 1 model it has in fact trained 26 for us: the 25 resamples + 1 final model. The advantage of this is that the final model is trained on 100% of the training data but we also get a robust measure of expected performance via the resampling. Likewise all of our model coefficients and variable importance scores are calculated from the whole training set. We’ll see later how we can use different resampling techniques e.g. k-fold cross-validation.
If we have a look at the structure of our train object we can see that it's a pretty complicated list. I've truncated the print out as it's pretty long:
Helpfully caret has a number of helper functions that allow us to easily extract and work with each of the different elements. I’ll run through the main ones I know (there are probably plenty I'm not aware of) but sometimes it seems there's no alternative than to work with the object directly so it's good to be aware of what's inside. We'll start with summary():
This shows us information about the final model such as its coefficients. We can see that most of our features are significant apart from 'random' 1, 2, & 4 and also 'depth' which if you remember from the plots didn't look like it had an obviously linear relationship with price so that makes sense.
Interestingly, 'random3' is significant with a p-value <0.05 even though we know it was randomly generated. How do we interpret this? It might be that, although randomly generated, the noise just happens, through bad luck, to have a positive association with the target (at least in our Training data). It's unlikely this relationship would carry over to the Test set as we've not uncovered a meaningful relationship between a randomly generated featured and the price of diamonds. Rather, we've had some bad luck with a spurious correlation.
This highlights one of the challenges of feature engineering and selection. In real life it's possible we could have multiple features like this that look significant through unfortunate correlations but without the benefit of knowing that they were in fact randomly generated. This is why domain expertise can be so important to check whether the associations make sense. We'll see in another post how we can get caret to perform feature selection for us to try and weed out features like this whilst still getting accurate and robust models.
We can extract any useful bits from summary to work with directly, such as the coefficients or the average resample performance estimates:
As well as the final, averaged estimate from across the resamples we can have a look at how our model performed on each resample. This can be a good way to check how variable the different performances are. For example, if our model generalises well, we'd hope the performance on each resample is fairly similar. If however we see a big variation this could be a sign our model isn't as stable as we'd like even if it averages out well:
We can see that our final model performance is indeed the same as if we calculate it ourselves by taking the average performance across resamples. We can also see that our model performance is pretty stable on each resample at around 0.12 MAE which is good. We can plot the different resample estimates to get an idea of how much our error rate varied for each of the resamples:
Finally, another really useful feature of train() is that it calculates variable importance for us. By default, these are displayed as scores relative to each other (the most important will always be 100) although we have the option to display them in their raw form too. We can also easily pass them to ggplot to visualise. You can read more about exactly how the scores are calculated here.
Some models create a variable importance score automatically as part of the model build whereas other don't and for these caret uses a generic filter method. It's worth noting that caret will always give variable importance measures but this doesn't mean they're always useful or at least shouldn't be used without reflection. For example, a bad, inaccurate model will still have important variables (and the top one will still score 100) even though it's unlikely they say anything significant about our data given how poor the model is.
For linear models, caret uses 'the absolute value of the t-statistic' from the model to rank features in terms of importance. From our summary() above this means we'd expect to see the feature 'carat' closely followed by 'color_J' which is indeed what we can see in our plot. It's always worth thinking about how the variable importance scores are generated as decisions we make in our workflow can greatly impact how they are calculated or displayed.
For example, some algorithms (glmnet) use the model coefficients to calculate importance rankings but these will be influenced by the scale of the underlying data so ideally we'd want to normalise our features first. Some implementations of random forest (ranger) also require us to specify that we'd like the variable importance scores to be calculated as part of the training process.
Resampling and k-fold cross-validation
We’ve already seen that caret gives us robust performance measures for our model automatically by creating 25 bootstrap samples as part of the training process. We also have a range of different resampling options available to us which you can read about by calling ?trainControl looking at the methods. You can read Max's thoughts on bootstrapping vs k-fold cross-validation here. I’m going to cover k-fold cross-validation, leave one group out validation (for when you have really large data) and also leave one out cross validation (for when you have really small data).
One thing to mention is why do we do resampling? The resampling itself doesn’t directly help improve the training of the final model as all the interim models trained on the resamples are discarded at the end of the process. The interim models might even have different coefficients or features selected to the final model. This is fine as during resampling what we’re measuring is less ‘how will this exact model with these exact parameters perform’ but rather ‘if I were to use this process to build a model, how is that model likely to perform when it sees new data’. This then allows us to pick between different models we’ve built e.g. different algorithms, different combinations of hyperparameters or feature sets.
The value of resampling is that it allows us to test our model workflow before we commit to running it on the Test set. We only try one, final model on the Test data as if we test more than one and pick the best we've started to optimise our model to the Test and so lose the ability to get an unbiased estimate of its general performance.
Instead of the default 25 bootstrap samples, let’s go ahead and run our linear model with repeated k-fold cross-validation instead. To do this we add a new option in our train() which is trControl() which tells caret how we want to do our resampling. Within trControl we specify all the options within another object called trainControl(). We can either specify these options every time we run our train() or it can be easier to save them into their own object that we can then just refer to. In the below example we create a new resampling schema and save it as an object called 'repeatedcv':
With 10 folds repeated 5 times, caret will build 50 models(!) each scored on a different Assessment set. It'll measure the average performance across all 50 and then build the final model (so 51 models in total). Reassuringly, our performance estimate of 0.119 MAE is very close to our original bootstrap estimate. This should be expected as both approaches are providing robust measurements of how we think our model will perform. Once again, we can plot the results for each fold to see not only the average performance but also how large the spread is.
Next up is ‘leave one out’ cross validation (LOOCV) which can be useful when you have really small data. This approach builds a model on 'all observations - 1' with the 1 left out as the Assessment set . This repeats until all observations have had a go at being the left out Assessment set and then the average performance is calculated. This means that we build as many models as we have observations in our data + 1 final model. Even though we sampled our data at the start, I’m not keen to run 5k+ linear models so I’ll sample the data again to a nice round 1,000 rows. Leave one out is only really used when you only have really small data.
We can see that we have 999 rows in each of our samples (with 1 being left out as the Assessment set). The value of MAE is a little bit higher this time at 0.122 but still in the same ballpark as our earlier methods.
Finally, we’ll look at ‘leave group out cross validation’ (LGOCV). This can be a useful approach when you have really large data. It essentially takes one sample from Train as the Assessment set and then uses the remaining data as the Analysis set. As it only takes 1 Assessment set it only actually builds 2 models (1 LGOCV resample + the final model). This makes it a lot faster to run but can be at the expense of confidence in how robust/stable our performance measure is.
For example, when we looked at the 25 bootstrap resamples before we saw the model scored better on some and worse on others but that they were all within a good tolerance of the average. If we only take 1 resample we lose the ability so see if we just got lucky/unlucky with our resample. This is why LGOCV works best when you have large data where it can be impractical to take repeated resamples and taking 1 random sample is still likely to be representative of our underlying data.
We do have the option to take repeated Assessment sets but these are likely to end up correlated with each other as there's nothing to stop the different resamples from overlapping like there is when we run k-fold cross-validation. Again our MAE of 0.121 is similar to the performance estimate from the other resampling techniques so it looks like it's working well. Ultimately which resampling approach you choose will often be guided by practical considerations such as size of data and training time.
Another handy feature of caret is that it allows us to save each of the predictions that our interim models made during the resampling. This way we not only get a performance summary for how the process performed per resample but we can also see where our model performed well or not. For example, we might be able to check if a certain data point is an outlier/data entry issue or whether our model struggles with certain classes e.g. it might do worse on cheaper or more expensive diamonds.
To access these prediction we need to tell caret to save the resampled predictions by specifying 'savePredictions = "final"' to our trainControl(). One thing to note is that if you have lots of data or lots of resamples this can cause the final train() object created to become quite large as caret needs to save a copy of every row with the actual and prediction for every iteration of resample. We can get a good idea of how our model is performing on the resamples by plotting the results:
That looks pretty good! The blue line is if our predictions vs the actual prices matched 100% so we can see that most of the observations are close to the line. It looks like our model might be struggling a bit at the tails - it seems to underestimate the price of the cheaper diamonds but conversely it sees to overestimate the more expensive ones. This could quickly get expensive if we were to use this to try and trade diamonds as any bargains would be at the cheaper end but we'd risk overpaying for more expensive ones. We'll try some different algorithms later on to see if we can improve upon this.
Up until now we’ve been having caret create our resamples for us as part of the model training process. We can also create our own resampling indexes in advance and then pass them to caret. This can be useful if we want to ensure that every model we train uses exactly the same resamples. We can use createMultiFolds() to create indexes for performing repeated, cross fold validation and save these into their own object. We can then pass these to caret in trainControl() using the 'index=' option:
One final and very useful resampling technique available in caret is the groupKFold() function which allows us to split off entire groups of data into their own folds. This is very useful when we have repeated measurements of the same subjects at different times in our data. For example, say our data set recorded the price sold of the same 100 diamonds measured over a 5-year period. We might still want to train a model to predict the selling price of a diamond as a function of its physical characteristics but we need to be a bit more careful about how we assess our model.
For example, if we just take random splits in the data, diamond 1, year 1, might go into the Analysis set but diamond 1, year 2, might go into the Assessment set. This will likely lead the model to be overly optimistic in its ability because the relationship it learns between price and physical characteristics for diamond 1, year 1, will likely translate very well for diamond 1, year 2, with a bit of noise from whatever the diamond market was like in year 2. This might be fine if we were only ever going to use our model to predict those 100 diamond prices but what if we want to try it on new diamonds?
Let’s try running this with some dummy made up data and see what happens. We’ll take a random sample of 100 diamonds and then to simulate multiple years of prices, we’ll create a new ‘year’ column and simply bind the 5 samples together and increase the price by 5% per year to factor in some dummy inflation in prices over time. We'll run a model with normal k-fold cross-validation and one with groupKfold and compare the results:
Although the drop isn’t huge, the model with groupKFold appears to perform slightly worse. However this lower measure of performance is more likely to be a truer reflection of how we would expect our model to perform on new data. By using groupKFold no same diamond is split across an Analysis and Assessment set and so we avoid leaking data about the same diamonds just measured at different times. What we measure instead is the model's ability to predict prices on wholly new diamonds which is what we wanted.
More modelling algorithms!
Up until now we’ve been running a basic linear regression but there are literally hundreds of models available to use in caret that we can call in pretty much exactly the same way. One of the challenges with caret is not only picking what type of model you want to use e.g. ‘a random forest’ but also which underlying package you want to use to implement it e.g. ‘ranger’. Below I’ve tried to include 10 examples of some of the most commonly used algorithms and picked the packages that get used as the default by the tidymodels package. The notable exception is there’s no neural network as it requires a couple of extra options to run and benefits from extensive tuning so I'll save that for the next post on hyperparameter tuning and model stacking in caret.
I’ve left each model as its own call to the function so you can run as many or as few as you like. You can ignore all the tics() and tocs() in the function as these just log the time taken for each model to run. Have a look in the print out to see how long each algorithm takes as there’s quite a range! If you know your computer might struggle I’d suggest running 'lm' and 'glmnet' first and checking the time before running some of the slower ones like 'SVM' and 'cubist'. Hopefully the function demonstrates the power of the caret package, with just the 6 lines of caret code we can create a function that allows us to run models from linear regression to random forest and support vector machines by only changing one option 'method=' in our code:
Looking at the run times we have everything from <1 second for glmnet all the way to 368 seconds for the cubist model. I hadn't actually heard of the cubist model until I started digging into caret a bit more but it turns out it's a really powerful and elegant model that can be used for regression problems. You can actually see Max giving a talk about it here from when he was at Pfizer. We can also see that we can get a warning from our decision tree that in one of the resamples it couldn't find a good split and so we're missing a performance estimate in that resample.
Once we’ve run all our models, we can start making use of some of caret’s helper functions to see which are performing the best. Using the resamples() function we can pass it a list of all our newly created model objects and easily plot their performance. The visualisation uses lattice plots so it's not as pretty as ggplot but can still be incredibly informative:
The dotplot() function shows us how our models performed on our resamples with the average performance and also a 95% confidence interval across 3 different assessment metrics (MAE, RMSE and R2). We can see that the decision tree is by far and away the worst performer and also the most unstable with the widest confidence interval. The other models all look to have much more stable performances which is good to see.
The cubist, random forest and linear kernal SVM are the top 3 performing models but our original linear model actually compares pretty favourably too! We can see precisely how it compares by returning a summary() of the resamples and inspecting the median MAE for each algorithm (I've shortened the print out to save space):
Our linear model scores ~0.12 MAE vs the best performing cubist model of ~0.09. At this point, if this were a real-world scenario and depending on the context, we might decide that our linear model is ‘good enough’ for our needs. It trains in a matter of seconds (compared to minutes taken by the cubist model) and still performs comparably well. It has the added real-world advantages that we can easily interpret its coefficients to see what’s driving performance and is often easier to explain to stakeholders. We could also quickly implement the scoring equation outside of R if required, by extracting the coefficients, into say SQL or even excel.
On the other hand, we know from our earlier resample plot that it struggles bit more in predicting the more expensive diamonds which might be a cause for concern. A lot of the other algorithms we’ve tried are also highly tuneable or benefit from having their data transformed in certain ways. We know we can get a 0.03% gain in MAE with a cubist model ‘out the box’ which is a relative performance gain of 25%. Maybe we can do even better?
Pre-processing data within resamples
For this section we'll take some of the best performing models so far and create some different pre-processing scenarios for them to see if we can make them perform even better. We can use the impute methods to replace the missing values in our data and use the recommended pre-processing steps from tidymodels to test out a few different pre-processing scenarios.
To run the pre-processing, we specify the 'preProcess =' option inside train() and then pass it a vector of all the different transformations we want to do. This way we ensure all of our pre-processing is done within the resampling process and so our performance estimate at the end is unbiased and free of any data leakage.
Let's try running a few different pre-processing workflows. For imputing our missing data I'll use the k-nearest neighbour method as in theory it's more powerful than using a simple median value and it's a lot faster than bagImpute. Processing speed becomes more of a concern when we're doing all of our steps within resamples, as each step has to be repeated for every (in our case 50) resample. This means our data will also get centred and scaled automatically which is not necessarily a bad thing either e.g. neural networks and support vector machines can benefit greatly from having their data normalised and tree based models aren't too fussy either way. We can also try creating some principal components or applying a Box-Cox/Yeo-Johnson transformation for those models that like their data to be more normally distributed and symmetrical.
The next group of models are all tree based so don't need the box-cox transformation which will make them a bit quicker to run:
Finally let's try running all of the models but this time with principal components to hoover up any correlated or near-zero variance features. If any of the models have been slow to train for you so far I'd skip this step as I found it took significantly longer again.
Let's update our resample object with all the new results and plot them to see if there's been much change:
We can see from our plot that the cubist models are still the top performers with maybe a slight boost from the pre-processing. We can see that for most models the pre-processing hasn’t led to huge changes in performance apart from the SVM with the radial kernel and also the MARS model. It looks like using principal components results in worse performance too which probably makes sense given our data i.e. we've not got loads of features and not that many that are highly correlated.
Looking at the various runtimes we can see the large increase in time required to do each of the pre-processing steps. This could become impractical if we had really large data. One way round this is to reduce the number of resamples but then we risk sacrificing measurement accuracy.
Another option we touched on earlier is to 'spend' some data, by taking a sample from Train to calculate all the pre-processing transformations with and then apply these to both Train-minus-sample and Test ahead of the model building. This way we only have to calculate our pre-processing steps once, rather than in every resample but we still avoid leaking data as our pre-processing calculation sample isn't used in the subsequent training process. This approach works best when we have large data and so still have a large Training set even after we take our sample. Let's try it now and see how much faster it can make things for us. I'll use glmnet for this example just because it naturally trains quickly:
Looking at the times, including all the pre-processing within the resamples took ~20x longer than computing them separately! This makes sense as each step had to be calculated 50x (10 folds x 5 repeats). If we look at the MAE of the models, they're very similar too, showing that, on this occasion, the separate pre-processing sample approach offers us a legitimate way of speeding up our model training process without sacrificing too much performance.
Speeding things up with parallel processing
Further gains in training time can be made by utilising parallel processing. Depending on how many cores we have available, we can use the doParallel package to register multiple instances of caret and RStudio to run the training process in parallel. For example, we have 10 folds repeated 5x as part of our resampling. Each fold is essentially its own, separate instance of training the model so there's nothing to stop us (in theory) having 10 instances each training their own model concurrently.
The downside with this is that we need to keep multiple copies of the data in memory and we’ll be utilising a lot more system resource (keep an eye on your system temperatures). I've found the process can cause my RStudio session to be a bit more unstable so I tend to only use it when I know I've got something that will take a long time to run (see the post on feature selection in caret for example).
Let's have a go at running our strongest but slowest cubist model using parallel processing. First we’ll run a check to see the maximum number of cores we have available to use. You generally don’t want to use all of them as you’ll need at least one for your operating system. In this example we’ll create 8 clusters and register them. This effectively means we have 8 copies of our project all now running at once. From here we just run out train() code like we normally would and caret knows to utilise the available workers to train the model in parallel.
Using our 8 clusters and running the model in parallel reduced our training time from 445 seconds to just over 70 which is ~6x faster. Once we've finished our parallel processing we can deregister the clusters and tell R to go back to running things sequentially:
Predicting on new data
Let's take our cubist model that we just ran and finally try it out on the Test set to see how it performs. As we included pre-processing as part of the model build, our 'cubist_parallel' object contains our imputations, centring, scaling and removing near-zero variance steps that will automatically get applied to Test. Before we can do this though, we need to add the log of price to Test and then create our dummy variables. I also like to create a copy of my Test set and call it something like 'scoring_set' so if I accidentally drop a column or corrupt it somehow I can recreate it easily:
Another thing I like to do is create a new object called 'final_model' that is a copy of my final model object. This way I can just refer to this object in all subsequent code rather than go through and update all the occurrences of the model name which saves time moving between projects.
To get predictions from our model we use the predict() function with the model object we want to use and the name of the new data we want to create predictions on. What we get from this is a vector of predictions that we can then merge back onto our data.
Just like when we used train() we also need to specify 'na.action = na.pass' to stop it automatically throwing an error when it encounters missing data. Another thing to note is that caret needs to see all of the columns in our Test data that it had access to in train() even if they weren't used in the model e.g. 'price' wasn't used in the training process but still needs to be present in the Test data or caret will error.
We can then simply cbind() this to our Test data to compare our predicted values vs our actual values:
We can then use the postResample() helper function to calculate different error metrics. This function works by taking in two vectors (in our case predictions and actuals) and calculating the error metric on them. We can then see how our in-training resample performance estimate compares to our held out Test set:
Those look pretty close! The fact our model performs well on the Test set, which is completely unseen, new data, should give us confidence that we have indeed built a good model. We can also see that there’s only a marginal difference between the MAE for Test and the resamples performance estimates which means our model workflow was built in a robust manner. Let’s plot the actuals vs prediction on the Test set to get a better idea of how our model performs on new data:
This also looks pretty good. We can see there's only a 0.0038 difference between our estimated and actual MAE which is great. The dense, black groupings on our plot shows that most diamonds sit close to our predicted price and our error rate looks pretty symmetrical apart from maybe at the extreme ends of both tails. This could in part be due to taking just a 10% sample of data to train the model on i.e. it didn't get to see many very cheap/very expensive diamonds when training.
In reality we'd usually take 80%-90% of our data for our Train which increases the chances of the model having access to the fullest range of values within our data to learn from. We'll see in the next part of this caret series how we can try and improve model performance even more by tuning the various hyperparameters of our models and also stacking/ensembling multiple models into a super-learner. For now though we should be pleased to have constructed a robust model that with little to no feature engineering or tuning was able to get us within ~8% of predicted diamond prices and improved upon our initial ~12% for our simple linear model. Well done!
Classification models in caret
So far all of the examples have been about regression but we can reuse a lot of the same steps to easily build a classification model. Let’s very quickly fudge our data to create a mock classification problem to demonstrate this. If we have a quick look at our original data again we can see that the top 25% of diamonds have a price of $5,324+.
Let's create a new target called 'expensive' which will represent these high value diamonds. For classification problems like this caret prefers its target encoded as a factor. The levels don't matter so we'll go with a “Yes/No” factor for our target column:
As this is just a dummy example reusing our Train and Scoring data we can jump straight in and train our models. We can reuse our existing k-fold cross-validation resampling scheme and the only change needed within train() is we tell caret to use one of its classification performance metrics. We'll stick with the default one of 'Accuracy' for now but later we'll see how we can get caret to compute ROC for us. I'll use glmnet for the algorithm simply because it's quick to train and works for both regression and classification problems:
That’s it! By just changing which performance metric we want to use we’ve successfully adapted our code to run a classification model. That's the power of the caret package. We can use our getTrainPerf() and varImp() helper functions in exactly the same way too:
As well as accuracy we might want to calculate ROC which requires a bit more effort (but not much). This time we do need to change our trainControl() a little bit to add the options ‘classProbs=T’ and ‘summaryFunction='twoClassSummary’'. We can then simply call out train again with the measure of ROC:
Once we’ve trained our model we can score up our Test data in much the same way as before except now there are two options for extracting the predicted probabilities from our model. We can specify 'type=‘prob’' to get the predicted class (Yes/No) for each row in our data or we can ask for 'type=‘raw’' which gives the raw predicted probabilities that a row is of a certain class. Let's see what both outputs look like and merge them onto our Test set. We'll also load the pROC library which we'll need for computing the ROC on our Test data:
To calculate ROC on the Test data we can use the roc() function from the pROC package. We pass it a vector of actual classes and a vector of our predicted class:
We can see that this this gives us an ROC of 0.993 on Test compared to 0.993 from Train showing that once again we've been able to build a powerful predictive model that generalises well when encountering new data. Well done!
Next steps: hyperparameter tuning and model stacking
In this tutorial we've covered splitting and transforming our data as well as running a few different 'out the box' (untuned) models. We've seen how to pre-process our data in a way that avoids leaking information and how caret can be used to easily move between building regression and classification models.
In the next tutorial we'll see how we can try and improve our model's performance even more by tuning hyperparameters and also creating 'super learners' by stacking/ensembling different models together. I'll also do a third post on how we can use caret's feature selection methods to select the best features for our model.