LightFM tutorial for creating recommendations including for cold-start users
In this post we're going to be using the LightFM package to create wine recommendations in Python. We'll create recommendations for users, calculate item-item similarities as well as use item and user features to solve the cold-start problem.
Contents
Hybrid Matrix Factorisation with LightFM
LightFM, according to the original paper, is a "hybrid matrix factorisation model representing users and items as linear combinations of their content features’ latent factors". There's also a talk from Maciej, the package author, on how it works here. Essentially LightFM uses collaborative filtering and matrix factorisation to learn embeddings for users and items but also for their features e.g. user age/gender or item colour/price, etc. This is what makes it a hybrid recommender as it uses collaborative filtering to learn the embeddings but the final representations of users and items are created from the sum of embeddings that correspond to the user/item features i.e. their content/metadata.
This makes it very powerful and flexible as it tries to marry the advantages of collaborative filtering with a content-based approach to making recommendations. As the embeddings for the features are learnt via collaborative filtering the hope is that they are more informative than just using item features alone. For example, with wine, the same grapes can have different names depending on their country of origin e.g. 'Syrah' and 'Shiraz'. A purely content-based model would look at these features and say the items are not similar i.e. they have different features. LightFM however can see that although they have different features, when it comes to learn the embeddings, it might discover that users who purchase 'Syrah' also purchase 'Shiraz' and so the representation it learns for the features will actually be very similar.
The feature data can also add additional information for the model to use which is particularly helpful when our data is very sparse i.e. we don't have a lot of interactions between users and items for it to learn from. For example, we might have an item with a few interactions but if we also know that it is a "New Zealand Sauvignon Blanc" we can recommend it with a high degree of confidence to users who purchase other items tagged as being New Zealand Sauvignon Blancs.
We have a couple of options when it comes to using feature data in LightFM. We'll look at examples using no additional features, just item features and then finally user and item features. Including feature data allows LightFM to solve the cold-start problem (creating recommendations for entirely new users or items) which traditional collaborative filtering models can struggle with. With LightFM, for new items and users that we don't have any interaction data for, we simply sum up the embeddings associated with the features we do have data for to create meaningful representations of our new item or user. An example of how this works is below. For the first wine, 'Bogle Merlot', we have interaction data so have an embedding for the wine as well as its other features. For the second wine, we don't have any interaction data but we can still create a representation for it from summing the embeddings of the features associated to it.
If we choose not to use features at all, our LightFM model works like a standard collaborative filtering, matrix factorisation model. As we still need to represent users and items in terms of their features though, LightFM simply represents each item and user as being their own unique feature i.e. the representation for User 1 is simply the embedding for a feature 'User 1' that is uniquely associated to them. LightFM calls these 'identify features' and creates them by default even when we include additional features. This way when LightFM learns all the embeddings to create the final representations, there is a unique embedding for each item/user included. This way when LightFM creates the recommendations, they are unique for each user and item even if they share other features.
For example, if we include the additional features that User 1 is 30 years old and female then their final representation would be the sum of the embeddings 'User 1' + 'age 30' + 'female'. The identity features are helpful as they allow the model to learn the individual preferences or behaviours for each user and item. For example, if we had User 5 who was also 30 years old and female, their final representation would still be unique to them as it would be the embedding for 'User 5' + 'age 30' + 'female'. We might expect the two users to share some recommendations as they share features but the model can still tailor the list to their individual preferences.
We can actually tell LightFM to not create identity features so User 1 and User 5 would both simply be represented as 'age 30' + 'female' and as such would get the same recommendations. This can be helpful in scenarios where lots of items or users are either new or anonymous but at the cost of severely restricting the expressiveness of the model e.g. we go from being able to create recommendations uniquely for each user/item to only making recommendations for N combination of user/item features. All the examples in this tutorial make use of identity features for users and items.
Now we know how LightFM creates the different representations for our users and items, let's go ahead and load our data:
We'll use orders tagged as 'prior' to form our training data as these are the earliest purchases by customers. We'll then use the 'train' data as our actual test data as these are the most up-to-date transactions that we have data for. The original Instacart challenge on Kaggle was to predict the re-ordering of items so the Test data in the data set is blank in terms of what items were actually bought. Now we've loaded our data, let's join it all and have a quick look at what the most commonly purchased wines are:
We can see that the top wine is 'Sauvignon Blanc' and purchased over 8k times! This quickly tails off though with the 10th most popular wine 'Red Blend' being purchased just over 1.7k times. The inclusion of the grapes in the product names is helpful as we can use these to create some item features later. For now, let's prep our data ready for LightFM.
LightFM is designed to work with implicit feedback which is when users interact with an item e.g. clicked on it, viewed it, bought it, etc. rather than explicit feedback where a user might interact with a product and then rate it as 'good' or award it 5 stars. The LightFM author actually did a whole talk on why you want to use implicit data for recommenders. Essentially there's more information, in terms of learning users' preferences, contained in what they have or haven't interacted with across all items rather than just comparing the rating of items within the subset of items they have interacted with. For example, let's say I buy five different Malbec wines. I rate one highly, one badly and don't review three of them. The key takeaway here is that I obviously really like Malbecs, not that I had one good one and one bad one.
As LightFM uses implicit data we only need to pass it a list of user + item interactions rather than worry about scores or ratings. We do have the option to weight certain interactions i.e. make the model focus more attention on correctly predicting those interactions which we'll experiment with later. To create the weights I count how many times a user buys a specific wine and then cap it at 5 so the most an interaction can be upweighted by is 5x even if it's bought more often. As the Test set is just for scoring we don't need weights for it. We'll also create a unique list of user IDs and product IDs which we'll pass to LightFM for it to create its internal mappings of users/items to indices from.
The other thing to note is that although LightFM can create predictions for new/cold-start items and users we don't actually want to include any IDs for these in the Training or Test data. This is because of how LightFM creates and uses the identity features. If we included IDs for cold-start users in Train they'd have no interactions. If we include them in Test and not Train then LightFM will error when it sees the new ID as it won't have a corresponding identity feature for the new item/user. We'll see how we can get round this later but for now we'll remove any items/users from Test that aren't in Train.
We also have a slightly interesting/unusual decision to make about how we measure our recommender results. Normally we'd want to find new items that users haven't interacted with to recommend them. However for something like grocery retail, recommending items that users have purchased before is probably a legitimate if not very original option. Let's create a Test set that includes repurchased items and one that only has new-to-user items i.e. not purchased before items. We can then compare how our model performs on both.
The way LightFM works behind the scenes is that it maps each item, user and all of the different features to unique indices that are then used to lookup those users/items/features in the various matrices that get used to create the recommendations. Thankfully LightFM provides some helper functions for creating the lookups for us.
One thing to note about the above is that even though we haven't passed in a list of user or item features this time, we still get 4 mappings back (user id, user features, item id, item features). This is because LightFM creates separate matrices for users and items and their associated features. In our case our feature matrices are just the identity features i.e. each user/item is their own feature which is why we get feature mappings even though we only passed in users/items. The ID and feature mappings will be the same length but we'll see later on when we start to add in extra features that the feature mappings get longer.
If we have a look at the mappings, we can see how our user IDs have been mapped to continuous indices that start at 0 e.g. user ID 4 maps to index 0, user ID 17 maps to index 1, etc. LightFM uses these internal mappings to construct our interaction matrices and also lookup matrices of user and items to their feature representations.
As well as creating our mappings to LightFM internal indices, we might want to create inverse mapping so we can get back to actual users and products when we come to make recommendations.
Once we've got our mappings between users and items and their LightFM indices, we can use the build_interactions() function to create our matrix of user-item interactions for LightFM to learn from. The function takes an iterable e.g. a list, tuple, dictionary, etc of user and item interactions and an optional weight column that tells LightFM how much more or less important that specific interaction is when it comes to training. If we don't pass a weight column, LightFM assigns every interaction the value of 1 by default. Let's go ahead and create our interaction matrix:
We can see that what we get out are 2 sparse matrices. The first is our interactions matrix which records user-item interaction and is 1 row per user and 1 column per item with a 1 where an interaction took place. The other sparse matrix is our weight matrix. If we didn't pass weights to the function this would be identical to our interactions matrix but as we did use weights this matrix is of the same shape but records the individual weights for the interactions. Let's convert them to dense types and see how they differ.
So we can see how the matrices both record user-item interactions. Our interaction matrix just has a 1/0 to record whether an interaction took place. Our weight matrix records the same interactions but also the weight for them e.g. on row 2 column 3 we can see an interaction took place and then on the same position in the weight matrix we can see that interaction has the value of 4. Let's now create our interactions matrices for our test data. We don't need to worry about weights for these as they are only used in training although note that LightFM still creates a weight matrix by default.
Our first LightFM Model
Now we've got our interaction matrices we can create our first model! To start with we'll just use a vanilla matrix factorisation approach without any additional features. Frist we define our model by calling LightFM() and setting the different parameters. I've pretty much left them on the default options but written them out anyway to give an idea of the options we have. Once we've defined our model we can call fit() to train the model and pass it our interactions matrix. The default value only trains the model for 1 epoch so I've set it to 20.
Assessing recommenders is slightly different to normal regression/classification problems. We're using binary did/didn't interact data as our target so it seems like a classification problem. However most users don't interact with most items so a measure like Accuracy isn't suitable as we'd probably get 99%+ Accuracy just by predicting 0 for everyone. We might also have some users who are highly likely to buy lots of wines but equally we'll have some who are very unlikely to buy anything at all. For normal classification we'd want to predict 0 for users who are unlikely to buy anything. However in a recommender setting this isn't an option. If we have 10 recommendation slots on the webpage to fill for each user, we can't just leave them blank because we don't think they're likely to buy anything, we still need to show them something.
The best way to think of recommendation problems is that they're ranking problems i.e. we need to show 10 items to each user so what we want is the best 10 items for that user even if we think there's a low chance they'll buy any. As such, instead of traditional classification metrics such as Precision or Recall we need to adapt them to account for the fact that we have to surface a set number of recommendations for each user. Typically how we do this is to use a 'metric @ k' where 'k' is the number of slots we need to fill or recommendations we need to surface for each user. LightFM has a few different built in assessment metrics and we'll be using Precision and Recall @ 10 i.e. what's the average Precision/Recall across users for the top 10 highest ranked recommendations for each user. We can calculate it for our Train, Test and Test-with-only-new-items data sets:
So our first model gets an average Precision @ 10 of 0.21 on the Training data but this drops quite a lot of 0.10 on the Test data and even further to 0.047 for predicting which new items users might go on to purchase. We can see though that Recall @ 10 is 0.36 which means we're still capturing over 1/3 of the products users do go on to buy in our recommendations, it just looks like most customers aren't buying many wines in general.
To use the predict() function in LightFM we need to pass it a list of User IDs and Item IDs in a slightly idiosyncratic format. Referring to the documentation "if you wish to generate the score for a few items (e.g. [7, 8, 9]) for two users (e.g. [0, 1]), a proper way to call this method would be to use lfm.predict([0, 0, 0, 1, 1, 1], [7, 8, 9, 7, 8, 9]), and _not_ lfm.predict([0, 1], [7, 8, 9]) as you may initially expect". So essentially we need a repeated value of User ID to pair against each item ID we want a prediction for. To get all predictions for all users at once we can do some list building before passing them to predict.
In the output above we've got 1 row per User ID and 1 column per Item ID in the order of their LightFM mapping indices e.g. LightFM User Index 0 is our first row and LightFM Item ID Index 0 is our first column. The actual scores of the predictions are meaningless apart from as a means of creating the rankings i.e. the prediction results are not probabilities and are not comparable across users.
We can also extract the embeddings for users and items directly from the model and calculate the predictions manually. LightFM has a get_representations() function that helpfully takes care of multiplying the various feature embeddings associated with an item/user by their respective weights in order to create the final representation which we can then extract. The final prediction is simply the dot product between the user and item embeddings with their respective biases. The biases tend to take on the role of encoding how popular an item is which allows the embeddings to hopefully capture the underlying nature of the user or item.
Let's convert the recommendations into something a bit more intelligible by extracting the top 10 recommended items for each user. We'll bring through the top 10 recommendations including and excluding previous purchases as well as the previous purchases themselves so we can see if our new recommendations are a good match with their historical preferences.
We can see that our user had previously purchased the Petite Syrah, Malbec and Merlot amongst others and that LightFM would have re-recommended all of them in the top 10 when making predictions. Let's try removing any previously purchased wines from the recommendations to see how they change.
One easy way to do this is to reuse our training interactions matrix which records all previous purchases as a 1, multiply it by a large number and then simply subtract it from our scores to artificially downweight the scores for all previously purchased lines. Let's try it now.
This time we get a list of recommendations that are completely new to the user. We can see that the Pinot Noir that was previously 3rd on the list is now at the top. We can see that some of the other top selling wines e.g. the Chardonnay and Sauvignon Blanc also make it onto the list. Finding that the model ends up recommending top sellers can be quite common. Although it's not necessarily a bad thing if we want to try and recommends more unusual or less popular items there is a quick fix we can try with LightFM.
As well as user and item representations the model learns user and item biases too. Commonly these do the job of capturing how popular an item is and then boost that item's score in the final prediction. To make predictions without any notion of popularity we can simply redo our dot product without the biases:
Although the Petite Syrah is still at the top, the other recommendations look a lot more obscure. We can also overwrite our model's biases (or make a copy of it and then do it!) with 0s and then use LightFM's predict() and evaluation functions as normal. Let's see how much of a drop off in performance we get by setting the biases to 0.
We can see there's quite a big drop in performance, particularly for when finding entirely new wines for users. So for now we'll leave the biases in our future models.
Calculating item-item similarities
Since we've already extracted our item representations (from the with-biases model) to manually create predictions, we can also use them to find similarities between items. To do this we use cosine similarity:
The top association for each wine looks ok e.g. for the Sauvignon Blanc it's another Sauvignon Blanc but after that the associations are less clear e.g. Malbec to Sherry? Let's try some hyperparameter tuning of our model to see if we can eke out some extra performance.
Hyperparameter tuning with LightFM and Optuna
We'll use the Optuna package which will try to automatically find the optimal set of hyperparameters for us from our search space. It does this by conducting repeated trials and modelling LightFM's performance as a function of the different hyperparameters and values that we gave Optuna to use.
To use Optuna we first create a 'study' which is essentially our hyperparameter search space, the data sets we want to use and our assessment metric which we then return. To avoid repeatedly using our Test data we'll split out Train into a smaller Train and Validation set using LightFM's train_test_split() function. One thing to note with this is that as the data is split randomly it doesn't preserve the chronology of purchases like our actual Train-Test data does. There is also the chance that as our data is so sparse we might have some users where all of their interactions fall into either the Train or Validation set. There's no possibility for repeat purchase data so our tuning run will be most closely resembling Train and Test-new for tuning purposes.
Another great feature of Optuna is we can pass in our original hyperparameter values to give it a 'warm start' in terms of values to explore and a baseline performance that it needs to beat when running the optimisation. Although we didn't use any regularisation on the original model, to keep the parameters on the same log-scale as the trial values, we'll give it the bare minimum.
I've set my study to run for 50 trials. Feel free to try more or less depending on how quickly it trains for you. Another nice feature of Optuna is that the best parameters from all the studies seen so far are kept so if you interrupt it you don't lose all of the learnings up to that point. Once the study is finished we can print out the best hyperparameter values.
We can see that the best model had quite a high number of components (the number of dimensions in the embeddings) and quite a low level of regularisation. Optuna actually has a function that attempts to measure how important each hyperparameter was in terms of contributing to the final performance of the model. It uses a random forest and the hyperparameter values at each iteration to try and predict the trial-model performance for that iteration. Let's see which hyperparameters Optuna thinks had a bigger impact on our final model's performance.
So it looks like the loss value had the biggest and then a distant second was item_aplha. Let's now try training the final model on 100% of Train and see how it performance on our Test data.
Our tuned model shows a marginal improvement on the Test-new data so it looks like it's been successful. If we wanted to we could run more trials in the hope that the performance continues to improve. Let's extract the user and item embeddings from the new model and see if our similar items make more sense now.
The similar items don't actually look that similar at this point and we seem to be getting a mix of red and white wines which we wouldn't really want. It looks like it's the same top selling lines e.g. cabernet savignon and chardonnay that are appearing in each list. This is probably due to the current sparsity of our data i.e. without much information to draw upon LightFM found it a sensible strategy to recommend top selling lines. This works for generating recommendations but for our item-item associations, recommending a cabernet sauvignon to someone who has just bought a sauvignon blanc doesn't feel ideal.
Weighting interactions
At the start we created some user-item weightings to reflect that users buy some items more often than others. Our initial models have just been treating all interactions equally but let's now try running it with the weights. As well as upweighting more important interactions we could have downweighted less important ones. This is one way that's suggested to deal with very popular items to stop them from always being recommended and try to make our recommendations more diverse.
To use weights with the train_test_split() we need to pass it separately along with the same random_state to ensure the splits happen in the same place. We can then pass whether or not to use weights as an extra hyperparameter to Optuna to see if it finds any benefit from their inclusion.
So it looks like Optuna found that not using interaction weights (or at least the ones we created at the start) didn't improve performance. This is probably not surprising given we upweighted any items regularly purchased by users but it seems like most users only buy a few wines so the difference was always going to be marginal. For completeness, let's train our newest model on Train and see how it does.
Still performing at around the 5% mark for Test-new precision at 10. Maybe we can try adding in some extra item features to try and boost performance.
Create item features
Interestingly when I was reading up on LightFM there are quite a lot of examples online of people reporting that including additional features actually made their models worse. This seemed surprising as we'd normally expect having access to additional data to be a good thing in machine learning. Reading more on the subject it seemed like there were two main reasons for this.
The first is that by including extra features we actually restrict the expressiveness of the model which is mentioned in a note in the documentation here. This makes sense as we go from having 1 feature per user/item whose job is just to individually represent that user/item in the most useful possible way to combining it with more generalised features that are shared across items/users. These more generalised features can be a good thing as our final representations are more general and so less likely to overfit and can work better in sparser/cold-start scenarios but there is a balance to be struck which leads to reason number two.
There's a good discussion on github around including features and how we need to be careful to only include meaningful/useful features as "if you add lots of uninformative features they will degrade your model by diluting the information provided by your good features". Essentially when creating the final representation LightFM goes user/item = sum(features * weights) so if we put loads of uninformative features into the model the final representations will be largely uninformative too. This means we actually need to practice the slightly old-school data science skill of feature engineering! Another option we'll explore later is adjusting the feature weights so that they have less impact on the final representations.
For now let's try and create some useful features that should be helpful when making wine recommendations. To do this I created ngrams out of all the wine names (extract the individual words or word sequences) and did a count to see which were the most common. I then used these to make features that I thought would be most useful which are broadly wine colour, country of origin, grape types, style, etc.
The script below first tidies up multiple spellings or the same feature having different names e.g. porto and port both refer to Port Wine. Some googling also showed that certain styles of wine or certain regions are linked to certain countries so I was able to extract a bit more country of origin data from those too. This is by no means a comprehensive list so if you want to try adding your own feel free. I also subsequently learnt that the protected term of 'champagne' can actually be used for a limited selection of Californian wines whereas normally it'd indicate an item is from the Champagne region in France.
Now we've got our list of item features created, let's see which are the most common ones.
So it looks like most of our wines are red or white with a few sparkling. The most common country is USA with 148 wines and then France and Italy with around 40 each. One thing to note when making features for LightFM is that as we later create index lookups for them, each feature needs to be uniquely named. We'll also go ahead and remove any completely blank columns for features that didn't match to any wines.
Now we've created our item features we can remake all of our mappings.
We can see from the above that our item mappings (index to item ID) is now shorter than our item metadata mappings which now has an index for each item ID + each feature ID. Let's create our inverse mappings and build our interactions. For features data LightFM likes to have a list of (user/item id, [feature1, feature2]) or (user/item id, {feature1: feature1_weight, feature2: feature2_weight}). Since we're not using weights yet we'll just create a list of item and features.
By default LightFM normalises (makes sure they all sum to 1) all of the features in the weight matrix. This is generally advisable as since we sum up all the embeddings for each feature to create the final representations we want our final representations to roughly all be on the same scale. For example, if an item with 3 features had a final representation 3x the size of an item with 1 feature then this would potentially skew things when we calculate the dot product as that is sensitive to the underlying size of the embeddings e.g. an item with lots of features could get a boosted score simply from having lots of features. This is what the weight matrix and the normalisation helps avoid. For our 3 feature item, its final representation would instead be (1/3*feature1) + (1/3 * feature2) + (1/3* feature3) instead of (feature1 + feature2 + feature3).
We can see some examples of the mappings and how they seem to be working pretty well. The products are all picked out as sparkling wines with 'brut' and 'rose' as additional characteristics. We can also see one of the 'Californian Champagnes' causing issues with the country of origin assignment! One thing to note is that we only include features that the users/items do have as opposed to recording them as not having those specific features.
Let's now try running Optuna but this time when we fit our model we can pass in our list of items and their features.
Let's take the best parameters and train our final model and see how it performs.
Performance is still around 5% precision @ 10 so at least we didn't make our model worse! Hopefully the added benefit of including the item features is that our suggested item-item recommendations are improved too. Let's see if that's the case.
Now these look a lot better! A large part of this will be due to the fact that our item tags will force items that share tags to at least be partially similar i.e. any shared tags between items means they'll also share the embedding for that tag in their final representation. This is why the model with metadata is less expressive as we're constraining the final representations to be more generalised i.e. rather than each item getting a bespoke embedding they're now the sum of their own bespoke embedding + more general embeddings of any tags that might be shared across products.
In theory this can stop the model overfitting although often people report it negatively impacting model performance. However there are a couple of powerful upsides to including metadata that might make a slight drop off in predictive power worthwhile. One of the main ones that we'll look at later is we can now make recommendations for new or cold-start products. For example, if we have a new French Cabernet Sauvignon, we don't have any user interactions for the product but we can still create a representation of it by summing up the already learnt embeddings for 'France' + 'Cabernet Sauvignon'. We can then either find similar items or predict which users might like it based on those features. We'll see how to do this with LightFM in a bit.
The other benefit that we can see above is that our item-item recommendations make a lot more sense and are easier to understand. A lot of this is because we're forcing items with shared features to share large proportions of their final representations but the hope is that even if it's the sharing of features that drives our top recommendations, each item still has its own bespoke identity embedding that we can learn from. For example, the top suggestions for buyers of the 'Cabernet Sauvignon' is the 'Cabernet Sauvignon, North Coast, 2011' which is just ahead of the 'Cabernet Sauvignon, North Coast, 2012'. This tells us that even amongst the 'red wine' + 'cabernet sauvignon' shared tags of the top wines the North Coast is the best fit and actually it can pick out which vintage is most appropriate since each year is attached to a separate item.
It's also useful to keep an eye out for other similar items that share fewer tags as this is the model telling us that although we tagged the items differently, the collaborative filtering exercise tells us that customers view the features (or the items attached to them) as actually being very similar. For example, looking at the top 10 suggested items for the Malbec we actually quite quickly move into Shiraz wines which tells us that a lot of Instacart users are buying both Malbec and Shirazs.
Adjusting feature weights with tf-idf
Before we move on to looking at user features let's try adjusting our weights for the items. At the moment the item and its features are weighted equally so if an item has 3 features, the final representation of that item is: 1/4 item identity embedding + 3/4 features embeddings. One way to tip the weightings in favour of a more expressive model whilst still retaining the benefit of making cold-start predictions and sensible item-item suggestions is to downweight the features in the final representations so more of the representation comes from the bespoke item identity embeddings. We could experiment with a few different weighting schemes and then use optuna to find the best one. For now I'm going to use sklearn's tfidf function to downweight common tags e.g. 'red wine' in the hope that it allows the individual item or more unusual features e.g. 'shiraz' to come to the fore.
To do this, we first need to create a data frame that has for each item, all of the unique features associated to it in the form a long text string.
These look pretty good. We can also see on row 4 the challenge with keyword searches where we've got a product as belonging to the USA and France! Now we've processed our text data we can call TfidfVectorizer() to create our weightings of each of the tags for each of the items. The code below creates a pandas data frame with a row for each product and a column for each item feature e.g. 'sparkling'. The value of the column is the associated tf-idf weight for that product and item feature. We can then loop through each row and return a dictionary of each item feature and its weight, filtering for where weights are >0:
We can now see for our first item 'Mirabella Rose Brut' that it has 3 features 'Rose', 'Brut' and 'Sparkling'. We can see that 'Rose' receives the highest rating which makes sense. The fact the item is a sparkling wine is definitely important, but the fact it's also a rose is probably more so. Let's now try running optuna with our tf-idf item matrix as one of the possible hyperparameters:
In this case the original evenly split weightings perform better (at least for the number of trials we ran). That keeps things nice and simple and actually makes the final model easier to explain which is a plus! Let's train the best model from our trial and see how it does overall.
Marginally worse but still around the 5% precision at 10 mark. So far we've seen how the item features can improve our item-item recommendations by encouraging products with similar features to be scored more closely. Let's try a more visual representation of this by reducing our embeddings down using t-sne.
Plotting associations with t-sne
Our hyperparameter 'no_components' control the size of the embeddings that are learnt for our items and users. Usually these are too big to plot so we need a way of reducing down the dimensions. We can use t-sne to do this. I'll create a couple of categorical columns that summarise some of the characteristics about the products and then we can plot the associations to see how wines with different attributes group together.
This is pretty cool. Each dot on the plot represents a product and the distance between them is t-sne's best attempt at condensing down the embeddings dimensions into 2D. If we then colour each plot by it's respective wine colour we can see that there is a divide between red and white wines, suggesting users tend to stick to a particular colour and that sparkling and rose wines sit somewhere in the middle. Let's do the same plot but just for red wines with their grape type.
Here we can see that the cabernet sauvignon wines tend to cluster together along with the merlots. This makes sense as often Merlot and Cabernet Sauvignon are blended together. We can see the blue cluster of products is all the Pinot Noirs and then there's a big group of unknown wines with the Syrahs and Malbecs in the middle.
So far we've looked at the associations between the items and see how the features data can help create more intuitive item-item recommendations. A nice perk of how LightFM works by learning separate embeddings for everything means we can also look at associations not just between items and users but also been their features.
Associations between item features
The LightFM documentation has an example of this but essentially we can calculate the cosine similarity between features in exactly the same way we do for items. We simply extract their representations directly from the model rather than using the get_representations(). We can then see what other features are related to each other.
For the feature 'Bordeaux', a famous wine region in France we see the most similar features are other French and European wine regions. For 'Cava', a sparkling wine made in Spain, we see it has a strong association to 'Rioja', a Spanish red wine, as well as some Italian features. Interestingly 'Malbec' and 'Argentina' also feature on the list. The feature embeddings can provide useful insight into the category. For example, although users find cava and prosecco similar, the relationship with champagne (another sparkling wine) obviously isn't as strong.
Recommendations for cold-start items
It's actually the embeddings for the item features that also allow us to make recommendation for new or cold-start items. These items don't have any user interactions so we can't create embeddings for them directly. However what we can do is express the item in terms of the features that we do have embeddings for. For example, let's say we're launching a new red wine from Bordeaux that's a merlot-cabernet sauvignon blend. We don't have an identity feature for the wine as it hasn't been interacted with yet but we can still create a representation of it by summing the embeddings for each feature. First, let's get the item feature indexes for each of our attributes.
We can create weights for each of these. To keep things simple we'll just assign them all the same weight which will be 1 / the number of features. The next part is to create a lookup row for our item that mimics the normal item feature matrix that LightFM is used to receiving. We create an array of all 0s that matches the length of the pre-existing item features. We then overwrite at each index for our feature that 0 with our feature weight. As a check we can sum the row to make sure our weights add up to 1.
Now we've created our cold-start item feature row we can convert it into a sparse matrix and pass it to LightFM. We can use the get_representations() function from LightFM to calculate the sum(weights*embeddings) for our item and then we can calculate the cosine similarity between it and other items.
It looks like LightFM has picked up on the fact our wine is a cabernet sauvignon blend and found us other cabernet sauvignons that it thinks are similar. This way we can find users that bought those items and recommend them our cold-start one on the basis that it's similar so those users should like it too.
If we want to create recommendations for users directly we can do this too. We simply take our cold-item embeddings and calculate the dot product against the user embeddings to create a recommendation score for our new item. We can then append that to all of the predictions we made previously and re-rank them to find users for whom the item ranks highly. Note that we calculate ranks per user rather than just take the highest score for the cold-item as the actual scores in LightFM only have meaning relative to each user as a means of ranking items and not between users.
So it looks like our new item would actually be a very good candidate for a number of users! Nice. Let's double check this by looking at the previous purchases of the top user to make sure a cabernet sauvignon-merlot blend from France would make sense as a recommendation.
Looking at their previous purchases it makes complete sense why our cold-start item would be a good recommendation for them! Now we've looked at adding in item features to our model, let's try adding in some user features too.
Adding in user features
Adding in user features works in exactly the same as item features. First we need to create our features and map them back to customers (with the option to include their weights). Since we don't have any obvious user features to hand e.g. age, gender, etc. let's create some using the other categories users have previously interacted with. If we were doing this properly we'd want to pick just a few of the categories we think are most important and weight them accordingly. To keep things simple we'll just add every other category a user has interacted with and weight them all equally.
This sort of blanket feature creation is probably where we run the risk of diluting down our useful features. On the other hand our data is so sparse (i.e. few users buy more than 1 or 2 wines) that the extra data might still be beneficial in this instance. First we'll get a list of all the non-wine categories our users have interacted with and then keep a unique list of categories to serve as our list of possible user features.
Now we'll create all of our mappings and interactions. The code at the end converts our list of user ID and aisle-shopped data frame into a list of user ID + aisles-shopped list that we can pass to LightFM to create our user feature matrix from.
So we can see for User 21 all of the other categories they have shopped in as features associated to them. The hope is that which other categories users shop contains some information about the types of wines they go on to purchase e.g. users that bought fish might prefer white wine to go with it, users buying organic products might prefer organic or natural wine, etc. Let's go ahead and train and tune our model with item and user features.
Again we get around 5% precision at 10 so it looks like we've not made our model worse by adding in user features. We could probably improve the performance even more by dropping some of the less relevant or useful category behaviours.
Recommendations for cold-start users
Now we've added user features we can make predictions for cold-start users just like we did for cold-start items. We simply create a custom user attribute matrix that has 0s for all the user identity features but we populate with the relevant weights for the categories our new user has previously shopped. We can then use LightFM to create the user representation which we can pass to predict.
It's tricky to know if these seem sensible without testing it against some cold-start users but at least we can see that the process of creating our cold-start user matrix is successfully returning different recommendations for each type of user we created. Just like we did for the item features, we can get LightFM to return the final representation of our cold-start users for us and we can then do our own dot product to create predictions.
That about wraps things up. Hopefully this tutorial has been a useful introduction to the LightFM package. We've seen how we can create recommendations using traditional matrix factorisation approaches and then try and boost performance and tackle the cold-start problem by including item and user features. We also explored tuning the hyperparameters and adjusting the various weight matrices LightFM uses to create its predictions. Well done!