Scikit-learn tutorial for machine learning in Python
In this post you'll learn how to use the scikit-learn package to split your data, pre-process it ready for modelling, create pipelines to avoid data leakage and perform cross validation to get robust performance estimates. Finally you'll build and compare a whole range of models as well as some from outside scikit-learn including XGBoost, LightGBM and CatBoost.
Contents
Introduction to scikit-learn and the diamonds data set
The scikit-learn package, also known as sklearn, is one of the most popular packages for doing machine learning in python. The documentation is incredibly comprehensive and the site has lots of tutorials. For this post I'm going to focus on how all the different functionality of scikit-learn can be combined to tackle a real world project. We'll look at some of the different data pre-processing options available, options for cross validation, different modelling algorithms and how to combine everything into robust modelling pipelines. In the next post we'll look at how to improve our models even more with hyperparameter tuning, feature selection and model stacking.
In this tutorial we'll be using the 'diamonds' data set that comes as part of the ggplot2 package in R to create some mock training data. We'll try to predict the price of the diamond based on its physical characteristics. There are a few size dimensions like carat, x (length in mm), y (width in mm) and z (depth in mm) and some categorical features regarding the quality of the diamond e.g. color, clarity and cut. A full list of features can be found here. Let's go ahead and load the data and run a couple of simple quality checks that are always useful whenever we start a new modelling project.
So we can see that our data doesn't have any completely duplicate rows which is good and we don't have any missing data. So far the data looks pretty clean. Let's use the describe() function to quickly get some summary statistics on the numerical columns. For the categorical ones, we can do a quick count of the number of diamonds in each category and also what percentage of diamonds are in each category:
Our describe() function has surfaced some unrealistic values for some of our diamond sizes. Each of x, y and z are meant to record various diamond dimensions in mm so having values of 0mm should be impossible. Its likely that these values are in fact missing/unrecorded and have been defaulted to 0. We can correct this by replacing any 0mm dimensions with NA.
We also appear to have some very large diamonds which again could be indicative of some data entry errors. A quick bit of googling suggests the largest, clear diamond in the world, the Culinan 1, measures 5.89cm × 4.54cm × 2.77cm weighing in at 530 carats! So to get some diamonds in our data that are supposedly 5.89cm wide or 3.1cm deep suggests these are likely to be data entry problems. This becomes even more likely when we see that the largest diamond is supposedly only 5 carats and the most expensive is $18.8k.
For the sake of creating some extra missing data to use in this tutorial, let's pretend we know that one of our brokers occasionally uses the size measurements of the larger, uncut diamond by mistake. Therefore, any time we find a diamond with a physical dimension over 1cm or a carat size over 3 we know it's a data entry issue and not to be trusted. We could remove these rows but assuming all the other the values (cut, color, etc) can be trusted we'll simply set them to NA and replace them with something more sensible later on.
It's worth mentioning that we're flagging and removing these outliers based on subject matter expertise to identify data entry issues i.e. incorrectly recorded data. This is different to identifying and removing outliers that are genuine data points but just larger/smaller/rarer compared to the rest of our data. It can be tempting to remove these kinds of outliers too but great care should be taken as outliers are data points too! There's a good discussion here about removing outliers. At the very least, removing outliers to try and improve our model counts as pre-processing our data and needs to be done as part of our wider model build and cross-validation scheme (more on this later).
As we'll be calculating feature importance estimates later on, let's add some random noise variables to our data to see if our model can correctly identify them as having no importance. Let's add a couple of randomly generated features and also one that's low variance with 99% of its values being 0.
We can now see we've got a few missing values where we've removed our data entry issues and have 5 new features. Usually when building a model we'd want to do lots of feature engineering to try and create new variables for the model to use. I'm going to skip this step and dive straight into using scikit-learn but a great resource on feature creation, that's also free to read online, is Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn, who created the caret and tidymodels packages in R and has published several books on machine learning.
As a final check, let's plot the distribution of our target variable as sometimes if it's skewed this can make it harder for our model to work with. There's a good discussion of the trade-offs between predictive power and interpretability of transforming the target here and an example from the scikit-learn docs about how it can improve model performance. Let's plot the original, untransformed price feature and also with a log transformation.
We can see that the original price data is strongly skewed to the right and has a very long tail. This tells us that although the bulk of diamonds are less valuable, there are a few much more valuable ones. The log transformation on the right looks to be a bit more normally distributed and so might be a good transformation to apply when we come to build our model. We'll see later how we can easily incorporate a target transformation into our modelling pipeline later on.
Creating Train and Test splits
To get an accurate final measurement of our model we need to create a Train and Test split. The training data is what we build our different modelling pipelines on and then once we have our final model, we test its performance on the Test set which is kept separate until the very end. The idea is that the Test set mimics the unseen future or new data that our model will encounter when it goes into production.
To do this in scikit-learn we can use the train_test_split() function. We simply pass it a dataset of our input features, our target column and then specify what number or proportion of observations we want to keep for our Test set. Normally I'd take a 20% or 10% sample to be held out as a Test set and train the model on 80%-90% of the data. For the purpose of this tutorial I'm going to flip that and only have 10% of the data in Train. This is just to speed up the running of things as the data set is >50k rows. We'll still be training on ~5k rows of data which is enough to get a robust and predictive model at the end.
We can also set a 'random_state' which makes the whole process repeatable i.e. when the data is randomly sampled to create Train and Test, the random state ensure the sampling is the same each time we run the function. Two other options I've not used are 'shuffle' and 'stratify'. 'Shuffle' which defaults to 'shuffle=True' means scikit-learn randomly shuffles the order of the data. This is generally advisable as if our data was loaded in a certain manner e.g. oldest records first then if we just sample the top 10% of the data we'd end up with a skewed sample of older records to train on.
'Stratify' is used when we have a categorical target and makes sure that when we sample we preserve the distribution of the target column across both our Train and Test data sets. For example, if we had a categorical target where 50% of values were 'Yes' and 50% 'No', stratified sampling would make sure we'd have a 50:50 split between 'Yes' and 'No' in both Train and Test. Scikit-learn doesn't offer a way to stratify continuous targets unfortunately but we could always make our own by bucketing the continuous target and then stratified sampling on that instead.
The output of the train_test_split() function creates 4 data sets: the input data for Train, the inputs for Test, the target values for Train and the target values for Test. We can go ahead and create these data sets and give them names that make it obvious what each is doing. We can also check their shape to see if we get all the rows and columns that we're expecting:
Let's double check that the distribution of the target column is similar across the Train and Test set.
It looks like the diamonds in our Training data are slightly more expensive on average but the mean and different quartile prices are close enough that we should be able to train a good model that generalises well on new data.
Visualising the data and checking for correlations
Now we've created our training data let's have a look at the distribution of each of the variables our model has got to work with. We've already seen our target column is right skewed with more lower value diamonds and a long tail of more valuable ones. Let's start by looking at our numerical features:
It looks like the distribution of 'carat' is similar to our price target with a right-skew. Most diamonds look to be on the smaller size but there are a few larger ones. Reading about diamonds online it's suggested that carat is one of the strongest determinants of price so it makes sense that the two distributions are similar. The 'depth' and 'table' features almost have a more normal distribution and there seems to be a more even but still right-skewed spread of sizes across the different length/width/depth ('x'/'y'/'z') dimensions. We can also see that our randomly generated columns have a flat distribution which is what we'd hope.
As well as looking at the indivudual features we can look at the relationship between pairs of features using seaborn's pairplot() method. To make the plot easier to interpret I've removed the random and low_var features:
We can see that the relationship between the different dimension features ('x', 'y' and 'z') and 'carat' are strongly correlated although it looks like the relationship tails off slightly for larger carat diamonds. The relation between 'x' and 'y' is almost perfectly linear although the relationship between 'x'/'y' to 'z' is linear but a bit nosier. We can use panda's corr() function to check the pairwise correlation between the features:
We can see that, as expected, 'x', 'y' and 'z' all have very high correlations to 'carat' whereas 'table' and 'depth' do not. There is almost a perfect correlation between 'x', 'y' and 'z' as well which suggests that we have quite a few redundant features in our model. Some models can struggle with lots of correlated features. We can perhaps try some pre-processing steps later, like principal component analysis, to remove these.
Scikit-learn conventions
The scikit-learn documentation has some good resources on the conventions used in the package and how the different modelling algorithms all follow the same design. We'll have a quick look at a couple before seeing them in action when we start pre-processing our data.
A lot of the scikit-learn objects that we'll use will be estimators: "An estimator is an object that fits a model based on some training data and is capable of inferring some properties on new data...All estimators implement the fit method". An estimator could be a model we want to train or a pre-processing step e.g. PCA or a rescaling step that needs to learn its parameters from our data. We can then take our fitted/trained estimator and use it on new data e.g. predict on the Test set or rescale new data using the parameters learnt from our training data.
To train our estimator we use the fit() function which updates our estimator object with the information learnt from the dataset we call fit on. If we want to then apply this to new data e.g. our Test set we can use the transform() function. Most of the time we want to use fit() and tranform() separately to avoid data leakage e.g. we fit() on the training data and then transform() or predict() on our Test data. For instances where it's ok to use both at once, scikit-learn also offers fit_transform() which combines the fitting and transforming into one step.
Our estimator object before it has been fitted can be thought of as a specification e.g. it contains all the steps we need to do something like create a Linear Regression or a rescaling transformation but only once it's fitted does it have the necessary information to do so. When we fit our estimator it gains additional information and attributes (called "estimated attributes") and the scikit-learn convention for these is that they "must always have a name ending with trailing underscore" e.g. attribute_.
Let's see a quick example of each of the above. We'll use the SimpleImputer() from scikit-learn which as the name suggests simply replaces any missing values in our data (imputing) with the average of the feature's value. Let's import the estimator and then create/instantiate an instance of the class. Once it has been fitted, it has a "statistics_" attribute which contains "The imputation fill value for each feature". Let's try and access this before we fit it to see what happens:
We can see that we get an error message. Let's try calling fit() on our estimator and pass it some data from our Train data set. Notice that we are updating our estimator directly by calling fit(), there's no need to create a new instance of imputer and overwrite it. After fitting, let's try accessing the statistics_ again:
Success! We can now see the three different values (one for each of 'x', 'y' and 'z') which show the mean value of each feature as found in the Train data set. It's these values that will then be used to replace any missing values in their respective columns. We can now use transform() to apply these to our Test data.
The reason we fit() on Train and transform() on Test is that our Test data is meant to represent future/unseen data. We shouldn't use its values to calculate any averages or parameters as in theory that data doesn't exist. By using fit() on Train and transform() on Test we avoid leaking information about our future Test data back into our modelling pipeline.
Imputing missing values
Now we've seen how scikit-learn works with fit() and transform() to estimate parameters let's look at how we can apply these to our data. We've already seen a quick example of the SimpleImputer but scikit-learn has other methods of replacing missing values, with a full guide here, and we'll have a look at each of these in turn:
With our SimpleImputer we can choose which approach scikit-learn takes to replacing missing values. By default it uses the mean of the column. Let's quickly run through again each of the different steps for fitting and transforming our data.
We can check that scikit-learn has correctly calculated the column averages and then applied them in the transformation by manually calculating the parameters and comparing them to our transformed result:
It's a match! We can see that scikit-learn correctly calculates the mean value of carat and z and then successfully imputes the previously missing values of carat in Train with the calculated average. As we're just looking at how the different transformations work and not actually overwriting any data we can use fit_transform() to speed things up a bit. Let's compare using the mean to impute the carat column versus the median:
We can see the median value of 0.71 is slightly lower than the average of 0.80 which suggests our data is still slightly skewed to the right i.e. there are a few extremely larger values causing the mean to be higher than the median.
When replacing missing values it can be helpful to record the fact that we've done so. If our data is not missing at random it could be informative to the model to know which observations had missing data. In our fictional diamond example maybe it's a certain supplier who tends to supply cheaper diamonds that has worse data entry standards and so is more likely to mistype or leave blank certain fields. By adding a flag for which observations and features had missing data the model could then learn this trend that "missing data = more likely to be a lower price". We can do this easily in scikit-learn with the "add_indicator" in our imputer:
Here we can see that not only do we get the imputed values returned but also 2 new features which are 0/1 flags to say if the observation had a missing value.
So far we've just looked at the simplest way of replacing missing values e.g. mean, median or most common. Scikit-learn also has some more advanced approaches. The IterativeImputer "models each feature with missing values as a function of other features, and uses that estimate for imputation". It basically builds a model to try and predict what the missing values are so different observations in the same column can end up with different imputed values depending on what their other features values are. It also runs multiple iterations per imputation before returning a final, single imputation value. The hope is that by running multiple imputation iterations for the same observation we get a more realistic and generalisable result. More details can be found in the user guide.
The benefit of using a model to predict missing values is that hopefully it's more accurate than just using the global average or most common value. It also means different observations with missing values in the same column can get different imputed values. This makes sense as if we had two diamonds with missing carat size and one had large values of x, y, z and the other had smaller values we'd expect the first diamond to have a bigger carat size and the second a smaller value. Let's try it and see how it differs to just using the mean like before:
We can see here that the iterative imputer returns different values for each of our missing carat data observations. The values are all similar though and on the larger side as we know from the simple imputer that the average carat is about ~0.8. This makes sense as when we created our dummy missing data we only applied it to observations where the carat size was >3. So the model has spotted that these probably are larger diamonds and given them a sensible imputation value.
Another way of finding sensible replacements for missing values is to use KNNImputer. This is where "values are imputed using the mean value from n_neighbors nearest neighbors found in the training set." So like the Iterative Imputer it's another model based approach that specifically uses the k-nearest neighbours algorithm to find sensible replacements for missing values.
We can see in this instance that our diamonds all get the same imputed value although that isn't guaranteed. Again it is likely due to the fact that the observations with missing carat sizes are all located in a similar neighbourhood i.e. larger diamonds due to how we created the missing data.
One thing to note with the KNNImputer is that as it uses the k-nearest neighbours algorithm, it will be sensitive to the underlying scale of the data when it comes to calculate the neighbours. For example, a feature that runs from 0-100 will seem like it has a greater distance between points than one that runs from 0-10. To counteract this and have the distance between observations standardised across features we can rescale our data before we run the KNNImputer. In fact, there are lots of modelling algorithms (e.g. neural networks, SVMs, etc) that also perform better with rescaled data. Let's look at some of the transformations that scikit-learn has that can help with this.
Scaling and transforming data
Scikit-learnhas a lot of different functions for rescaling and transforming data with a full list available in the documentation. According to the scikit-learn docs: "Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance."
The tidymodels team also have a handy summary table of which models benefit from the different types of pre-processing. Here we'll look at some of the main ones such as standardising, binning and transforming that all work with continuous data. Later we'll look at transformations for categorical data as well as dimensionality reduction using PCA.
The best way to see how each of these transformations behaves is to visualise the change in the underlying data after the transformation. Let's use the 'x' feature from Train and remind ourselves what the raw data distribution looks like:
We can see that the value range from about 3 - 10 with peaks around 4 and 7. We can confirm this by calling describe() on it and getting some summary statistics:
We can see the average value is 5.7 with a standard deviation of 1.1 which is quite different to the mean = 0 and standard deviation = 1 that a lot of models prefer. An easy way to achieve this is with the StandardScaler function which subtracts the mean from each observation and divides by the standard deviation. Let's create an instance of the function and fit it on some data. Whereas before we created an instance and then in another line called fit(), we can do it all in one line to save a bit of room.
We can see that after fitting the transformer we have some estimated attributes such as the names of the features we've passed to it as well as the calculated mean and standard deviation. These have been learnt from the Train data set and will be used to rescale our data when we call transform. We can see that the mean and standard deviation for 'x' is the same as when we called describe() earlier.
Let's go ahead and transform our data and to make it obvious why we might want to rescale our data and the effect of StandardScaler, let's create a boxplot of our features before and after the transformation:
We can see that our data on the left has quite a few different scales with the 'depth' and 'table' features having much larger values than the others. If we look at our rescaled data on the right, we can see that now they are all on a much more comparable scale and centred around 0. We can actually learn some more useful information about data too such as the 'depth' feature has a wider spread of values with likely outliers that are many standard deviations away from the mean.
The scikit-learn docs have a nice example of how each of the different transformations affect the scale of the data so we'll just have a quick look at them here. For each transformation we'll look at the mean, standard deviation, min and max values and also plot the transformed data. We'll see that some transformations such as StandardScaler or MinMaxScaler simply rescale our data but leave the distribution/shape unchanged whilst the QuantileTransformer and PowerTransformer affect both:
We can see that the StandardScaler, MinMaxScaler and the RobustScaler simply rescale our data but the shape is unchanged. The MinMaxScaler simply makes sure all our values are between 0-1 and the RobustScaler is similar to the StandardScaler except it uses the median and quantile ranges to rescale the data to make it more robust to outliers.
The two versions of the QuantileTransformer and the PowerTransformer change the distribution of our data in an attempt to make it look more normal with mean=0 and std=1. According to the documentation, the QuantileTransformer "transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers". The PowerTransformer applies a "power transform featurewise to make data more Gaussian-like".
We can see that our transformed data with the quantile-normal and two power transformations does appear to be more normally distributed after the transformations. We can also see that the data is mean=0 and standard deviation=1 after the transformations. Scikit-learn also has a useful summary of how the PowerTransformer and QuantileTransformer behave when applied to other types of distributions. The main takeaway seems to be that the QuantileTransformer can force any distribution to become normal provided it has enough training samples (1,000s) but is harder to interpret than the power transformers and can overfit on smaller (100s) data. Which is the better transformation to use isn't going to be obvious in advance and we might want to try pipelines with different transformations to compare the results which we'll see how to do later.
As well as transforming our continuous data we have the option to convert it into a categorical feature by binning the values. Generally we want to avoid binning our data but I'll quickly cover the options offered by scikit-learn with the KBinsDiscretizer function. There's also a handy user guide from scikit-learn on how to use it. Essentially we need to pick a strategy for how we want to create our bins and then also how we want it encoded i.e. whether we want the resulting categorical data to be either one-hot encoded i.e. a column per bin or as a single ordinal feature. The three different 'strategies' available are:
-
‘uniform’: All bins in each feature have identical widths.
-
‘quantile’: All bins in each feature have the same number of points.
-
‘kmeans’: Values in each bin have the same nearest center of a 1D k-means cluster
By default the function uses 5 bins with a quantile strategy and a sparse one-hot encoding. Let's see what the output look like if we bin our 'x' feature with just 3 bins and ask for a dense one-hot encoding.
We can see that our 'x' feature has been transformed into 3 new features. The dense one-hot encoding means that we have no missing values and the features are either 0/1. What about if we specify that we want to encode our data into a single, ordinal column?
Here we see we only get 1 feature and the different bins are encoded as a range of integer values from 0-2 to reflect the fact we have 3 bins. We can see how the different binning strategies create different cut off values for each bin by fitting 3 different binning functions and extracting the 'edges' with the bin_edges_ attribute:
We can see that the uniform bins are all about 1.9 units wide which is to be expected as scikit-learn tries to find equal width bins. To get the same number of observations per bin the quantile strategy has got a narrower range for the first bin and a wider range for the final bin reflecting the fact that there are more smaller diamonds and fewer, larger diamonds. The bin cut-offs for the k-means strategy are pretty similar to the quantile strategy in this example.
Categorical data transformations
So far we've been transforming our continuous/numerical features. Scikit-learn also has a range of transformations that we can apply to our categorical data as well. We've actually seen an example of two of them already when we binned our data and created either one-hot encoded or ordinally encoded bins. Both these transformations are available via the OneHotEncoder or the OrdinalEncoder function. A more recent development is the TargetEncoder which replaces each value of our categorical data with the average target value for that category value. The scikit-learn docs has a nice example that compares the three approaches.
The OrdinalEncoder is probably the simplest transformation. It simply replaces each category with an integer. We get the same number of features out as we pass in as we can see below:
We can see that we pass 3 features in and get 3 features out the difference being our string values have been replaced with corresponding integers values. We can see that 'Premium' category from 'cut' gets replaced with the number 3 and the letter 'F' from 'color' has been replaced with the number 2.
We've got a couple of extra options available to us around what to do when the function encounters either missing values (by default these don't get replaced by a number) and also when it encounters a new category value that wasn't present in the data fit() was called on. For example, in our Test data we might have a 'cut' category of "Bad" that we'd still want to replace with a number. Scikit-learn offers us the option to specify an option to deal with any new values and bucket them all into a new value:
Here we can see how the last two values "Bad" and "Really bad!" which weren't in our Training data both get bucketed into the -1 value that we specified as the catch all for any new values. We can do something similar for missing data which by default isn't assigned an integer. Creating a unique integer for missing categorical data can be a nice way to remove missing values whilst also creating a flag for the model that records which observations had missing data.
We also have the option to collapse infrequently occurring categories into a single category with the 'min_frequency' option. We simply specify the minimum fraction of observations that a category must have to not be classed as infrequent. For example, we can collapse any category level that has 10% or fewer observations in it:
We can see that the two infrequently occuring categories have been collapsed into the same integer value during the transformation. We also have the option to collapse infrequently occurring categories into a single category with the 'max_categories' option. This sets the upper limit on how many unique integers/category levels we want in our output, not including the new and missing levels that we might specify separately. Let's see how this works in practice:
We can see that the three smallest categories have all been grouped together. One challenge with encoding categorical data ordinally is that the integer encodings begin to look like continuous numerical data and our machine learning algorithms will begin to use it in that way. This can be less of an issue for tree based models but a problem for linear ones. For example, in the above example a 'Fair' cut diamond gets a value of 2 but a 'Premium' cut diamond a value of 1. The linear model will interpret this encoding as meaning the 'Fair' diamond being twice the value of 'Premium' but that relationship doesn't really make sense.
One way to avoid this is to create separate columns for each level of the category with each taking the value 0/1 depending on which category the observation belonged to. This is called 'one-hot encoding' and done in scikit-learn by the OneHotEncoder function. By default it creates a sparse output but we can turn it into a dense one (no missing values) with the option 'sparse_output=False':
Since each column only takes a 0 or 1 it gets around the problem of certain categories getting a higher value than others. The main downside is that we can end up creating a lot of extra columns. One way around this is to use the 'min_frequency' option like we did with the OrdinalEncoder. The code below compares how many columns are created during our transformation with and without collapsing any categories with less than 10% of observations:
We get a reduction of 4 features by collapsing some of the more infrequent categories. Like with the OrdinalEncoder, we can also set up the transformation to handle new category levels that it might encounter after training:
We saw that the main drawback of the OrdinalEncoder was that some categories get arbitrarily larger values than others. This is solved with the OneHotEncoder but with the downside that we get a lot more features which can make training our model slower. At this point it's worth thinking about why we want to include the categorical data in the first place.
Presumably we think that they contain useful information and that certain categories have different and informative relationships with the target that the model can learn. One way to do this, that also gets around having lots of features and some of them having arbitrary values, is to simply encode the average value of the target in the place of each category. For our diamonds data set we can simple replace each category in 'cut', 'color' and 'clarity' with the average price of all the diamonds in the category. This is what the TargetEncoder does.
On the face of it this might seem like a sure-fire way to leak data into our training workflow and if we simply took the average of our target for each category and fed it into our training data it definitely would be. How scikit-learn mitigates this by using cross-validation (which we'll go into more detail on later) to avoid leaking data when it does the encoding.
Essentially the process splits all the observations into groups and then the category value for a given group of diamonds is replaced with the average price for that category of the other groups of diamonds. This continues in a round-robin way until all the groups have had their categories replaced with the average value from the other groups.
This way the model never has information about the specific price of the diamond it is considering, but only information about other diamonds that belonged to the same category and so we avoid directly leaking data about the target for any observation. A further check to aid the generalisability of the process is that any categories that have very few observations get their category average replaced with something that is closer to the overall feature average to try and prevent overfitting.
We can see that we passed three categorical features in and got three features returned that are all numerical which is much more efficient than one-hot encoding. Any differences in values between the encoded categories are also meaningful as they will be reflective of the average difference in value between the categories which is an advantage over ordinal encoding. The same category value in the feature will have slightly different values as a result of the cross-validation e.g. by default there are 5 folds so there will be 5 values per category that have been calculated.
We've now looked at how to transform numerical data and encode categorical data as part of our pre-processing. To finish off this section we'll have a quick look at how we can reduce the number of features (dimensionality reduction) using principal component analysis.
Principal Component Analysis (PCA)
Principal component analysis (PCA) takes our initial features and replaces them with linear combinations of themselves. The idea is that the new principal components capture as much (linear) information as possible from our original data and can capture and combine highly correlated and also lower variance features which saves us from having to remove them.
The principal components are created so that the first component explains the most variance in our data set and then each subsequent component explains less and less. This means it can be used as a tool for reducing down the number of dimensions in our data as we might choose to discard some of the later components without losing much information from our data. A handy benefit of PCA is that the principal components we create are also uncorrelated with each other which is preferred by some machine learning algorithms.
There’s a great StatQuest that explains the technical side of how PCA works and there’s a great 3Blue1Brown video on the linear algebra behind it here. There's also this site which has an interactive visualisation of how PCA works and let’s you play around with the data points and see how it changes the principal components.
To perform PCA in scikit-learn we can use the PCA() function. One thing to note is that before we do PCA, we need to impute any missing values and standardise our data. This is because PCA uses the variance within the data to decide on the loadings for the various components and so features with larger scales will be artificially inflated and skew our PCA results. PCA also only works on numerical data so in the below example we drop the categorical features before imputing and scaling.
Now we've fit our PCA object we can extract the various components from it. We can also see how much variance each component accounts for either individually or cumulatively to help us decide on a threshold after which we discard some of the components
The array of the first component is the loading of each feature on the principal component. These are the weights against each feature that are multiplied and then summed to create the linear combination that is the first principal component. We can see that the first component accounts for 37% of all the variance in our data set and the second component accounts for 11%. The cumulative variance shows that after just 7 components we capture 94% of the variance so we could possibly discard the final 4 components as they're not adding much information and might even just be noise.
Interpreting principal components is difficult as they are a weighted combination of all the features however what can be instructive is to see which features have the strongest weighting for each component. The sign of the loading (positive/negative) isn't relevant as the components capture variance which would remain the same if we flipped the sign for each of the loadings. Let's plot the loadings of the first 3 components that capture 58% of the total variance in our data.
The first principal component which we saw accounts for 37% of the variance in our data looks to be made up of the different size features e.g. 'carat', 'x', 'y' and 'z'. This makes sense as we know from earlier investigations that they are all strongly correlated. The second feature appears to combine 'depth' and 'table' and is potentially then capturing the overall shape of the diamond after size has been accounted for by PC1. Finally the third component looks like it has combined all the random noise variables that we created.
Potentially we could just use the first 2 components as these capture the variance from the genuine features and probably a lot of the remaining variance in our data is actually being driven by the random noise features which wouldn't be useful for our model anyway. The challenge is that in real world applications where our data is noisy it usually isn't as obvious which features are noise vs genuine.
Pipelines
In the PCA example we saw how we needed to combine an imputation, transformation and rescaling step before we applied PCA. This involved creating lots of interim objects and applying the transformations sequentially. For larger sets of steps this could end up looking messy with lots of code and objects and introduce the risk of copy and paste errors.
Thankfully scikit-learn provides an easy way to chain different steps together into pipelines. As well as making it easier to combine lots of steps, we'll see how these really come into their own in the next section when we look at cross-validation. According to the scikit-learn docs "The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters."
To create our pipeline we simply call the Pipeline() function and pass it a "List of (name of step, estimator) tuples that are to be chained in sequential order." The ability to name the individual steps is useful as we start to create more complicated pipelines and want to refer to or access specific elements of the different steps. Let's go ahead and remake our PCA pre-processing steps into a pipeline:
We can see from the print out that our 'pipe' object is a list of steps but nothing has actually been executed yet. Just like with our earlier pre-processing objects we can then fit and transform using our pipeline. Calling fit() will update each of the steps in our pipeline ready for us to call transform() and actually implement it on our data:
Scikit-learnprovides an alternative way to construct pipelines with the make_pipeline() function which "is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically."
When we called transform() on our data you might have spotted that the we passed in a pandas data frame but the output is an array. This is because scikit-learn likes to work with arrays. However, in newer versions of scikit-learn, the feature names from pandas data sets are carried through into the scikit-learn pipelines and can be access with the get_feature_names_out() function:
This can make it very easy to convert our array back into a data frame if we want to:
One of the limitations for pipelines is that they can only work on data of the same type. For example, all of the steps above were performed on numeric data. If we tried to include the categorical data we'd get errors. Thankfully scikit-learn has another function that allows us to combine different pipelines and feature pre-processing steps into one. This is done with the ColumnTransformer which "allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space."
It works similarly to how we created our pipelines i.e. we pass a tuple of (name, pipeline) which is itself a list of tuples of (name, transformation). In the below example we create two objects that split our training data into numerical and categorical feature subsets. We can then construct two pipelines which contain separate pre-processing steps for the numerical and categorical data. The output of these steps are then combined by the ColumnTransformer to create our final data set. The option "remainder = 'passthrough'" means we keep any columns that haven't been mentioned in the pipeline steps. We could also set this option to 'drop'.
Printing out our ColumnTransformer object we can see each of the steps and which features are included in each of them. If we go ahead and fit() our object on Train we can then extract the feature names from our fitted object:
Notice how our feature names are now prefaced by their pipeline and a double underscore "__". This is because the features now count as nested parameters and the double underscore is scikit-learn's convention for accessing them.
The feature names count as nested as they belong to a step in the pipeline which is itself a step in the ColumnTransformer. Within our ColumnTransformer for example we could reuse the same features in multiple pipelines e.g. perhaps experimenting with a power transformation of carat in one and then binning it in another to create some additional features. If we did this it wouldn't make sense to refer to 'carat' anymore as there would be two instances, each one tied to their own pipeline. This is why the feature names get prefaced with their pipeline name and need to be access with __.
Hopefully this section has served as a useful introduction to pipelines. We'll see how we can create even more complicated ones in the next tutorial that combines data pre-processing, feature selection and hyperparameter tuning all into one pipeline. For now we'll see why pipelines are so invaluable when we want to combine multiple pre-processing steps with robust model performance estimation with cross-validation.
Cross Validation
At the very start we created a Train and Test split so we could build our model on Train and accurately measure its likely performance on unseen data with Test. However building a model is an iterative process. We might want to try different feature combinations, different pre-processing steps and different modelling algorithms. The Test set though is only there to measure the performance of our final model, if we tested every model iteration on it we'd start to optimise our model towards it and so lose its ability to provide an unbiased estimate of our final model's performance on new data. However, we also can't measure each model's performance just on Train as we'd have no way of telling if a model is actually better or just really overfitting our training data.
The usual way round this problem is to do some form of cross-validation. This is where we create separate partitions within our Train dataset for the purpose of measuring interim model performances during the building and training process. These extra partitions are normally referred to as 'Validation' or 'Assessment' sets. The idea is we train our interim models on some of Train i.e. Train-excluding-the-Validation-set and then use the left out Validation data to measure the interim model's performance. This way we can see how our model performs on unseen data without having to use the Test set. There's a good explanation of it in Max Kuhn's book on feature engineering here and it's also worth reading the scikit-learn guide on cross-validation.
A challenge with using a Validation set is that we lose some of the data for our model training which can affect performance. How we sample our data to create the Validation set could also impact the measurement accuracy if we don't have much data or some categorical values are rare. K-fold cross-validation aims to get around these issues by creating multiple Train-Validation splits and then repeating the train and holdout process so we get multiple estimates of our model's performance that we can then average to try and get a more robust result. Once we have these results and are able to identify which of our interim models performed best, we can retrain a final model using those parameters on 100% of Train before measuring the performance on Test.
The diagram below from the scikit-learn docs gives a good illustration of how this works in practice:
In the above example the data is split into 5 equally sized 'folds' and the model is trained on 4 of the folds with the 5th 'held out' as a Validation set. The process repeats until every fold has been a holdout fold. The downside with k-fold cross-validation is that it requires us to build a lot of models. If we have 5 folds that means we build and measure 5 models (one per holdout fold) and then a final 6th model on 100% of Train. If we have lots of data this could make the training process too slow so there are other options available such as ShuffleSplit which can be configured to only take one Validation set similar to how we created the Test set.
It's worth noting as well that the models built during cross-validation do not learn from each other or directly help improve performance. Each model is discarded once we've measured its performance on the holdout fold. We might also get very different models with different performances and features selected across folds. This is completely fine as at this point we're not interested in building the best or final model but rather seeking to answer "if I were to train a model with these parameters e.g. this algorithm, with these hyperparameters and pre-processing steps, how would it perform on new data?"
What we're testing with cross-validation is different modelling schemas or pipelines, building lots of models defined by them on different snapshots of data and then measuring their average performance to try and find out which schema is likely to perform the best. Once we've found our best modelling pipeline via cross validation then we train our final model and measure its performance on the Test set.
Let's see how it works in practice by creating a modelling pipeline. We'll reuse the pre-processing pipeline from before and this time add a linear regression model into it as our estimator:
The huge benefit of building our pipeline with scikit-learn is that it takes care of all the data splits for us and avoids leaking any data. For example, up until now we've been applying our pre-processing steps on 100% of Train to help visualise and understand the different transformations. But just as we wouldn't call fit() on Test or a combination of Train and Test, when we're using cross-validation we can't pre-process 100% of Train in advance anymore. This is because our Validation set is drawn from Train and is meant to mimic our Test set. We need to fit() on Train-without-Validation and then transform() the Validation data. Thankfully scikit-learn takes care of all of this for us.
Let's create a cross-validation object that stores the specification of how many folds we want. The 'shuffle = True' option means that rather than process our data by row it'll randomly shuffle the row order before creating the folds. This is generally preferable to avoid unintentionally creating folds based on chronology e.g. older data might occur at the top of our dataset. The 'random_state' just means that we'll get the same random sample result each time we call the object so we know our folds will be the same across different model runs.
Now we've got our modelling pipeline and created our cross-validation object we can tell scikit-learn to run and measure our modelling pipeline. The easiest way to do this the cross_val_score() function which takes in our pipeline, the training data and our cross-validation object and then trains and assesses the model on each Validation set for us automatically. We can also specify what type of accuracy measure we want to use. By convention, scikit-learn tries to maximise model performance so our error metric is the negative RMSE i.e. by finding the highest negative value of RMSE it will look for a value closest to 0.
To get a positive value of RMSE we can simply wrap the process in abs(). As we have 5 folds we'll get 5 measurements of our pipeline's performance. This can be useful to see how much variability there is in scores. Ideally, if our model is generalising well, we'd hope for them to all be very similar to each other. We can then take the mean() of the results and also calculate the standard deviation:
So it looks like our pipeline is getting an average RMSE of around $1,737with a standard deviation of $18 across folds. Some of that variability will be down to data variance across samples. If we wanted to we can actually get scikit-learn to repeat the process of taking k-folds multiple times i.e. run repeated versions of k-fold cross-validation where each repeat samples different observations when creating the folds.
This can give us extra confidence in the robustness of our results but comes at the additional cost of creating even more models. We also need enough data to be able to create meaningfully different folds each time we repeat the process. Let's try it now using the RepeatedKFold() function:
In the above example we created 5 lots of 5 folds so 25 models in total! For really big data it might not be feasible to run the entire pipeline multiple times. If our data is big enough we might simply take a single 20-25% Validation set and use that to measure our interim pipeline's performance. To do that in scikit-learn we can use ShuffleSplit():
We can see that for each of the 3 approaches we get similar RMSE scores of $1,737 for k-fold, $1,729 for repeated k-fold and $1,761 for our single shuffle split. Each score is similar which is reassuring. It's worth highlighting though that the different scores do not mean that a particular cross-validation method somehow created a better model but rather will be down to sampling variance in how the Validation set was created each time. Remember cross-validation is about having accurate measurement of the training process rather than increasing model performance directly.
For example, the single Validation ShuffleSplit result of $1,761 is similar to the first k-fold result of $1,761. However we can see that the results from the other 4 folds are all <$1,750 which suggests we might have just gotten unlucky with the sampling that created the first fold or ShuffleSplit. This is why repeating the process is so important. If we wanted to pick an estimate for our final model pipeline I'd choose the $1,729 for the repeated k-fold as that measures performance across the greatest number of Validation sets and should incorporate the effect of lots of different sampling biases to average out at something closer to the truth not because it got the the best RMSE!
Models
So far we've run a simple linear regression model on our data with a pipeline and measured its performance using cross-validation. We can easily swap out our linear model for a different algorithm and there is a wide selection available to choose form with scikit-learn and a full list available here.
Part of the fun of writing these tutorials is trying out new things so I've tried to make the following as comprehensive a list as possible to test out below. If it's too much or takes too long to run on your system, feel free to comment out any you're not interested in. The large-scale learning and robust-to-outlier models work for our regression task but are really designed for other applications so are unlikely to be top performers.
If you're not sure which to pick I would say go for: LinearRegression() as a good baseline, ElasticNetCV() as a linear model that can perform feature selection for you (the CV version also does some hyperparameter tuning by default but we'll cover that in the next post). The DecisionTreeRegressor() is a classic decision tree so will handle correlated features, non-linear trends, interactions and feature selection and serve as a nice comparison to the linear models. More powerful but less interpretable models are the ensemble models: RandomForestRegressor() and HistGradientBoostingRegressor(). Finally you might want to try the MLPRegressor() which is a neural network.
After creating our list of modelling algorithms we want to try, we can simply loop through each one using scikit-learn to create a pipeline that takes our pre-processing pipeline and then a model from our list. We can use our same k-fold cross-validation object to measure performance and then append the results into a data frame to compare performance.
Nice! So out of the 25 models we tried it looks like the ExtraTreesRegressor() performed the best. It looks like the top spots are all claimed by tree-based models which suggests our data might have lots of non-linear trends or interactions in it that these models are better at capturing.
Another advantage of using scikit-learn is that as it's such a well-established package, newer models developed outside of it are designed to use a lot of the conventions and will work out the box with our existing pipelines. To demonstrate this, let's import 3 of the most powerful machine learning packages for tabular data: XGBoost, CatBoost and LightGBM to see how we can easily integrate them into our modelling workflow:
We can see that all 3 of the new models performed well with CatBoost taking the top spot. This makes sense as all 3 are tree-based gradient boosting models. Perhaps one of the reasons our linear models didn't perform well is that our pre-processing steps weren't quite as optimal as they could be? Let's try creating some additional pre-processing pipelines using a mixture of the different transformations we saw earlier to see if we can tease out some extra performance.
Comparing different pre-processing pipelines
Our current pipeline uses the IterativeImputer, performs a Yeo-Johnson transformation to give our numerical data a more normal shape, one-hot encodes our categorical data and then standardises everything to get it all on the same scale. Let's try applying some different transformations and using PCA and then we'll create some separate tree-only pipelines that don't worry about transforming our data but compares one-hot vs ordinal encodings.
We'll add our 3 new boosted tree models into the original list of models and then create a separate list of tree-based models to try with our tree pipeline. We'll add a new column to our results table that not only records the model and its performance but also the pre-processing pipeline:
Let's run our loops and append the performance to our data frame. As we've got 28 models x 2 pipelines and a further 9 models x 2 for our tree pipelines (in addition to our original 28 results!) I've added a step that'll rank the best performance for each modelling algorithm and then only return its best performing pipeline:
We can see that the Cat Boost model is still number 1 and one of our new pipelines managed to increase its performance from around 647 to 629 RMSE. In fact a lot of the top models preferred our new pipelines which generally left the shape of our data untransformed. A lot of the tree based models preferred the ordinal encoding to one-hot encoding too. A lot of our linear models actually performed best with our original pipeline which is reassuring at least.
We saw at the start how the distribution of our target data was right-skewed with most diamonds being cheaper but with a long tail of more expensive ones. We saw that a simple log transformation could make our data look more like a normal distribution. Let's try re-running our pipelines with different target transformations to see if by changing the target we can help our models create better predictions.
The code below creates a list of target transformations to try: a simple log, a StandardScaler, a Quantile Transformer with normal output and a Yeo-Johnson transformation. For this I also create a separate data frame to store the results in. Rather than run through every single pre-processing pipeline x every target transformation I just take the current best performing pre-processing pipeline for each model and then try that with each target transformation. Finally, to save space, I only keep the best performing target transformation for each model.
We can see that the Cat Boost is still top followed by a lot of the same tree based models. However, have a look at the performance of the linear models. They've improved hugely! The DummyRegressor which is a simple baseline model that just predicts the average value from the data for every observation is now well and truly at the bottom of the table whereas before it was outperforming a couple of the support vector machines. This just shows what an impact transforming our target has had for these models.
To make it even more obvious, let's merge our two sets of results so we can see how each model, with its preferred pre-processing pipeline, performs with and without transforming the target:
We can see that most models performed better with some form of target transformation. Our previously best performing CatBoost model shows a slight increase in performance but the linear models, neural network and support vector machines show huge improvements! In some cases we've better-than-halved our error rate.
Our ElasticNetCV has gone from an RMSE of 2657 to around 749 which is nearly as good as our best tree based models. This is with the same features and pre-processing steps as before so all the gains are purely from correctly transforming the target. Let's take the best performing models overall and re-rank them to find our final best models.
This is quite interesting as we can see that most of the models performed better with some form of target transformation and actually a straightforward log transformation more often than not performed the best. This is useful to know as a log transformation doesn't need to learn parameters from our data so it's one less thing to train and avoids the possibility of data leakage.
Our top performing model overall is still the Cat Boost now with a Yeo-Johnson target transformation. The Top 7 models are all some form of tree based model and the best performing non-tree model is the neural network (MLPRegressor) at number 8 followed by the LassoCV at number 9 and ElasticNetCV rounding out the top 10.
Interestingly the linear models with inbuilt feature selection, e.g. both LassoCV and ElasticNetCV use L1 regularisation that allows feature coefficients to be set 0, outperform the RidgeCV and regular LinearRegression which are required to use all of the features in the input set. This makes sense as we know there are at least 5 useless features that we created at the start.
In terms of picking a final model, I have a preference for quick and interpretable linear models but in this instance the boosted tree models have much better performance. We'd have to transform the target anyway for our linear model which makes it less easy to interpret them so we'd be better off going for the extra predictive power of our CatBoost model. The only change I'd potentially make is to see if a vanilla log transformation of the target works nearly as well as the power transformation as it's an easier pre-processing step to maintain and explain.
Now we've tested all our models and pipeline let's take our final model and fit it on 100% of Train and score up our Test data. Once we've done that we can see how we can get the feature importance scores from our final model.
Predicting on Test and calculating Feature Importance
Now we've got our best performing model and pipeline we can finally see how it performs on the Test set. If we've made a robust model, one that generalises well to new data and isn't overfitting and our pipeline doesn't have any leaks, we should expect our model to perform similarly well on the Test set as it did on the holdout folds.
To predict on the Test set we create one final pipeline of our best model and transformations and this time we fit it on all of the training data i.e. no holdout folds or Validation sets. Once we've trained the model on 100% of Train we can use the predict() function on our Test data. To calculate the RMSE from our predictions we can wrap the prediction with the mean_squared_error() function and set "squared = False" to calculate the RMSE of our final model on the Test data.
Our RMSE on the Test set of $583.97 is close to our value of $622.49 on the holdout folds and is actually a slight improvement. This could potentially be down to the model having slightly more training data to work with on the final fit i.e. 100% of Train instead of 80% when we trained with 5 folds. It could also be down to using a relatively small sample for training which makes measuring performance harder and more liable to sampling variances.
As we've finally fit our model on 100% of the Train data we can now calculate the relative importance of each of the features used in our final model. We've used a tree-based model so can use the .feature_importances_ attribute from our fitted model:
We can see that Cat Boost found the 'y' feature (width in mm) the most useful followed by the 'carat', 'z' (depth in mm) , 'x' (length in mm) and 'clarity'. These all make sense as they essentially record the size of the diamond followed by how blemished it is. Reassuringly, we can see that our randomly generated features all have a low importance ascribed to them.
If we'd picked a linear model for our final model we can use the .coef_ attribute to get the coefficients for each feature as another way of determining importance. One thing to note when interpreting the coefficients of linear models is that they will be impacted by the scale of the underlying feature which is why it's important to standardise all the feature scales in a pre-processing step.
We can see this time we get a lot more features in our final model due to the categorical features being one-hot encoded. I've put an abs() around the coefficients so their scales are all positive so as to make comparing their relative sizes easier. We can see that 'carat' has the largest coefficient suggesting that the LassoCV model found it to have the largest impact on the price of the diamond. We could also run the code without the abs() to see which feature have the largest impact but also in which direction i.e. a price increase or decrease.
Next steps: feature selection, hyperparameter tuning and model stacking
Well done on making it this far! In this post we've seen how we can use scikit-learn to split our data into Train and Test, the different pre-processing options available to us including transforming our target variable to improve model performance. We've seen how to avoid data leakage by using pipelines and to accurately measure performance using cross validation. Finally we saw how we can easily build and compare a multitude of different models before picking our final one and measuring its performance on the Test set and calculating feature importance.
So far we've run nearly all of our models with the default hyperparameters and the full list of features even though we know some of them are just random noise. In the next post we'll have a look at how we can tune the different hyperparameters of our models to increase their performance even more, perform feature selection to weed out any uninformative features and also combine multiple models into a super-learner!