Market Basket Analysis tutorial in R with dplyr
In this post you'll learn how to perform market basket analysis using dplyr and understand from first principles how the associations are calculated. You'll learn how to calculate common metrics such as Support, Confidence and Lift and understand some of the considerations and watch outs along the way.
Contents
What is market basket analysis?
Market basket analysis is a type of affinity analysis that aims to find associations between products based on how often they occur together in the same basket. It uses the idea that if things occur together more often than we'd expect, then we can assume they're related in some way. We can then use this association to help us make decisions such as which products we put next to each other in store or gain insight into how customers are shopping.
For example, let's say we see that customers who put eggs in their basket are 5x more likely to also put in flour. This might give us reason to think that flour and eggs are related in some way. We might hypothesise that it's due to customers doing home baking. This might then lead us to position the eggs and flour near each other in store in the hopes of encouraging more customers to cross shop between the products.
The 'basket' doesn't even need to be a basket of goods in the traditional sense either. It can be whatever unit of analysis we're interested in e.g. we might find that a customer who applies for a mortgage is 3x more likely to subsequently take out home insurance a few months later. Again this makes sense as someone buying a home will need to insure it. This could lead us to then offer home insurance products to new mortgage customers or offer them as a bundle. Market basket analysis is also often used on e-commerce sites for the 'customers who bought X also bought Y' recommendations. It's a really quick and powerful way to uncover important associations and the dplyr package in R makes this really easy to do.
Instacart Market Basket Analysis
For this tutorial we'll be using the Instacart data from the 'Instacart Market Basket Analysis' competition on Kaggle. You can download the data here. It's split across a few separate files that we'll need to load and merge. The files contain over 3 million orders so we'll take a 10k sample of customers just so things don't take too long to run.
Instacart also helpfully provide the order in which the items were added to the basket. We'll look to incorporate this into our analysis later on but ignore it for now to keep things simple. To start with, we'll be measuring the association between pairs of products e.g. 'eggs and flour' as opposed to a sequence of products e.g. 'eggs then flour'. It looks like the data doesn't contain any returns/refunded products. This is helpful as these can cause issues when finding associations. We'll see some examples of this later on too. For now, let's go ahead and read in the data and create our master table. Although we'll pretty much exclusively be using dplyr for the data manipulation I load the entire tidyverse library so we get other useful packages like ggplot too:
The user_id links orders back to a single user, order_id gives us our unique identifier for the transaction and then we have our product information. The aisle tells us which category the product belongs too which can be helpful for sense checking any associations we find. Now we've got our slimmed down table we've got everything we need to calculate our associations. Let's have a quick look at our data to get an idea of what it'll be like to work with. First we'll see how many different orders and products we've got in our sample:
We've got a large number of both baskets and products to work with. Let's understand a bit more about the composition of those baskets now by seeing how many different items there are in them:
So it looks like most baskets have maybe 7-8 products in, there are some very large baskets and the most common number of products is actually 1. This means we could end up with lines where the strongest association might be to 'nothing'. Although this might be a useful insight, a recommendation that 'customers who bought this also bought nothing' isn't very helpful so we'll restrict our associations to be between pairs of actual products. Let's have a look at what the most popular products and aisles are:
The top two products both look to be bananas followed by strawberries, spinach and avocado. It looks like we've also got quite a few occurrences of missing product names. If we were doing this for real and thought these were genuine records just with missing product names then we could try and use the product ID instead or see if we could find an alternative source for product names to supplement our data. For this tutorial though, we'll just remove these products. Let's do that and also check what some of the least popular products are:
It looks like we've got quite a few products that have only ever occurred in 1 basket. This isn't ideal as it means we don't have much evidence to back up any associations we might find. For products that have only ever occurred in 1 basket, any other product in that basket will have an association of 100% to it but we wouldn't really want to say this is meaningful. We'll see how we can deal with this later. For now, let's see which aisles are the most popular.
Once again it looks like fresh fruit and vegetables are the most commonly purchased from aisles. This makes sense as fresh produce will often have a shorter shelf life and so need to be bought more often compared to something like laundry detergent which can be bought in bulk and last a lot longer.
Defining 'more likely'
Before we have a look at the data, it's worth having a think about what something happening 'more often than we'd expect' or being '5x more likely' means in practice. We're essentially comparing how often something has occurred with how often we'd expect it to.
For some things, it's easy to calculate the expected rate. For example, if you were to toss a fair coin twice the chance of getting two tails is 1 in 4. We know this because our coin has two sides (so two outcomes), each with an equal probability of being landed on and each of the events of tossing the coin is independent from the other. This gives us an expected outcome of 1 in 2 for tails on the first coin toss, 1 in 2 of tails on the second coin toss which we multiply together gives us our expected rate of 1 in 4 for two tails occurring.
We can use the same approach to work out the chance of rolling a 6 on a fair die, followed by getting a tail on a fair coin toss where the probabilities are independent. We'd expect this set of events to occur, on average, every 1 in 12 attempts (1/6 for rolling the 6 x 1/2 for getting the tails). If we decided to put this to the test and found that actually we we're consistently rolling a 6 and getting a tails more often, say 6 in every 12 attempts, we'd think something was up. We'd say that what we're seeing is happening 6x more often than we'd expect from chance alone. We'd then say something is causing the events to happen together more often than we'd expect e.g. maybe the person is cheating.
This works well for games of chance where the probabilities are known in advance but what about when we have no way of knowing the odds in advance? For example, how often should we expect someone to buy eggs and flour? Well, although we can't know the odds in advance like we can with the coin toss, we can use the data we have available to us to do a similar job.
For example, we can easily see how many baskets contain eggs (we already did this when looking at the most popular products). We can also see how many baskets have flour in. If we convert these numbers to percentages i.e. 'what percent of baskets have eggs/flour in' suddenly this looks very similar to our coin and dice scenario. Say 10% of baskets have eggs in (1 in 10) and 5% of baskets have flour (1 in 20). Assuming they're independent, like our coin and our dice, then we'd expect to see them occur together in roughly 1 in every 200 baskets (1/10 for eggs x 1/20 for flour) which corresponds to 0.5%. If then when we do a count of baskets with eggs & flour we see that they occur together in 1.5% of baskets then we can say they're 3x more likely than expected to occur together.
This is the clever and elegant approach behind market basket analysis. We use what we know about how often products occur individually to work out what their expected rate of co-occurrence would be if they were independent. We can then calculate the actual co-occurrence and compare it to the expected rate to see whether they occur together more of less often than we'd expect. If the actual vs the expected is different we can then assume that there's something about how customers are shopping that's causing the association. We can use this knowledge to improve our decision making as a business. Let's now see how we can do this with our data.
Creating our base tables
There are packages in R which will calculate the various metrics needed for market basket analysis for you but it's very easy to do it yourself from scratch. This has the advantage of not only giving you a better understanding of how things are calculated but you can also iterate more easily and carry over the knowledge to another language like SQL or Python. A lot of other measures of association (like Yule's Q) also use the same base tables we're about to create.
Remember when we were defining what it means for something to be 'more likely' we talked about how often things occur on their own, expressed as a percentage and also how often they occur together as a percentage? To calculate these and then calculate our associations, we only actually need 3 base tables. Between them, these three tables give us everything we need:
-
The number of baskets each product occurred in (doesn't matter if it was on its own or with other lines).
-
The total number of baskets there are in our universe i.e. how many baskets there were in total.
-
The number of baskets that had both our products in i.e. the number of baskets that had both eggs & flour in or the number of baskets that had eggs & flour & butter if we want to look at associations between larger groups of products.
Let's calculate each of these now, starting with the easiest, the total number of baskets:
Pretty straightforward, we can see that there are just under 154k baskets in our data set. Let's now get the number of baskets each item was in. We've actually already done this when we looked at the top sellers:
The last table is the trickiest (and takes the longest to run) but we essentially want to create something that will ultimately allow us to count of how many baskets both our products occurred in. How we can do this is take our 'orders' table and join it back to itself using the 'order_id'. This means for each order, we have 1 column of products that show what's in the basket and then a second column which is every other product that was also in the basket. In our first column, each item row gets duplicated to correspond to a row for every item in the second column.
This join can quickly make our data set get quite large. We know most baskets had around 7 products in which means that each of those baskets will now have 49 rows of data as every product will be associated with every product in the basket (including itself). If this sounds confusing don't worry, we'll see an example in a bit.
If we then group by these two columns and count the number of orders, we get for each pair of products the total number of baskets they were both in. We can also filter out any rows where the product is associated to itself. If we wanted to look at 3 product combinations e.g. milk & eggs & flour we'd simply have an extra join so we'd have 3 columns (and the rows per basket would be cubed e.g. 7x7x7=343). Let's see what this output looks like:
There are two things to note here. The first is the table's pretty big at ~9.5m rows. The second is that we have a lot of equivalent associations. For example, 'Bag of Organic Bananas & Organic Hass Avocado' has the same number of co-occurrences as 'Organic Hass Avocado & Bag of Organic Bananas' which makes sense as they show the same thing i.e. the number of times the items occurred with the other.
Equivalent in this context means the associations contain the same information. For example 'X & Y' has the same information as 'Y & X'. Later we'll see an example where we can use the 'add_to_cart_order' column to introduce a time element i.e. 'Bag of Organic Bananas then Organic Hass Avocado' which would measure something different from 'Organic Hass Avocado then Bag of Organic Bananas'.
For now though, to help us get a better idea of what happens with the previous join, we can see how it works for just one basket and without summarising:
We can see that what happens is for every product in the basket we get a new column which is also every product in the basket. This requires us to duplicate out the rows for the first column to match. What this then gives us is 'every product x every other product in the basket'. We're essentially doing a cross-join but within baskets only. From here we can just group by the two columns and count the number of rows which give us how many baskets both products co-occurred in.
As we're only looking at co-occurrence for now, we can make use of the fact that 'X & Y' = 'Y & X' and the fact that we have a product number to shrink our table and remove some of the duplicated equivalences. Since we have a numeric identifier to go with our product name we can use a clever filter of 'product_id.x>product_id.y' to halve the size of our table. We arbitrarily tell it to only keep rows where the product_id in the first column is greater than the value of the second. This means we'll lose half our records but this is ok because for every row we remove where Y<X they'll be the equivalent row of X>Y so we preserve all the information we need:
This table is a bit more manageable at around 4.8m rows. We can see that we still have our top association of 'Bag of Organic Bananas & Organic Hass Avocado' but we no longer have the equivalent association just written the other way around. This can be helpful if you only want to look at symmetrical measures such as support and lift where A ->B = B ->A However there is another association measure we'll see later called confidence that isn't symmetrical so for now we'll continue to use the full table of associations. Now we've created our 3 base tables we're ready to calculate our associations.
Support: how often something occurs
There are a few standard metrics that get calculated as part of market basket analysis and unfortunately most of them don't have very intuitive names. We'll run through each in turn starting with 'support'. Support basically just means how often something occurs. So the support for 'eggs' would just be 'eggs / total baskets' or the support for 'flour' would be 'flour / total baskets'. We calculate the support for our pairs of products in exactly the same way e.g. the support of 'eggs & flour' as an associated pair would just be 'number of baskets with eggs & flour / total number of baskets'. Let's calculate the support for each item and also for each of our associated pairs below:
We can see that the most popular product is the Banana and it occurs in 14.4% of all baskets. We can see that our most common co-occurrence is the Hass Avocado and Bag of Bananas and these occur together in roughly 2% of baskets which isn't a lot. This is likely due to the wide range of products that Instacart sells and also the number of single item baskets that we saw earlier. We've actually got everything we need here to work out which of our associations are happening more often than we'd expect. Before we do however let's look at the second commonly calculated measure called 'confidence'.
Confidence: given the presence of X, what is the likelihood that Y will also be in the basket?
With support we calculate how often our item combinations are occurring in the general population. With confidence we focus specifically on baskets that have already got our first item in. For example, if the support of eggs and flour tells us how often people are buying eggs and flour, the confidence of eggs and flour tells us that given we know the basket has eggs in what is the likelihood that it will also have flour in. This is the same as working out what proportion of egg baskets also have flour in. To calculate this we simply reuse our already calculated support of X and Y and change our denominator to the number of baskets with X e.g. 'number of baskets with eggs & flour / number of baskets with eggs'.
Unlike support, confidence is not symmetrical. For example let's say there are 10 baskets for the association 'eggs and flour' which is equivalent to 'flour and eggs'. However what if there are 100 egg baskets overall but only 50 flour baskets? The confidence of 'eggs and flour' would be 10/100 or 10% whereas the confidence for 'flour and eggs' would be 10/50 or 20%.
Before we saw that only 2% of baskets had both a Bag of Bananas and a Hass Avocado which didn't seem like many. Now however, we can see that nearly 17% of Bag of Bananas baskets also have a Hass Avocado. The relationship is even stronger in the opposite direction with over 30% of Hass Avocado baskets also having a Bag on Bananas in. Confidence can be helpful as two items might have low support because they're not widely shopped products in general but if we see they have high confidence this might suggest there is some association between them.
Something to watch out for though is that very common items will tend to have high confidence to other items just from the fact that they occur in a lot of baskets in general e.g. if we had a product that was in 80% of customers' baskets then the expected confidence between other lines and it would be around 80% from chance alone. To work out whether the association is higher than we'd expect we need some way of accounting for how often we'd expect things to co-occur naturally and we'll see how we can do this by calculating the 'lift' next.
Where confidence can be handy is as a measure for either setting targets or measuring how well current cross sell activity is performing. For example, if we were a coffee shop we might have a target to sell a cake along with every 3rd coffee that we sell. We could then see what the confidence of 'coffee and cake' is to see if we're hitting that 33% target. Alternatively, we might report on the confidence of 'coffee and cake' by different locations and see that in some stores the confidence is much higher and we could then try and find out what they're doing differently to try and improve our cross-sell rate in other stores. For example, Instacart may well already promote 'Bag of Bananas' on the Hass Avocado page so they might even be disappointed at a confidence of 30%.
Lift: is the co-occurrence happening more or less often than we'd expect?
So far we've seen that the most commonly bought-together products, 'Hass Avocado' and 'Bag of Bananas', are only actually bought together in 2% of baskets and that 30% of Hass Avocado baskets have a Bag of Bananas in them (and 17% vice versa). What we'd really like to know is 'are these two products more likely to be bought together?' and this is what 'lift' does for us. Lift essentially compares the actual rate of co-occurrence vs the expected rate. This makes it more useful than relying on confidence alone as it factors in how popular the products are in general.There are two ways of calculating lift and I'll start with the one I find the most intuitive first.
As discussed previously, the expected rate of co-occurrence, if the likelihood of purchasing either product is independent, is the support(Hass Avocado) x support(Bag of Bananas). Multiplying how often each product is bought in general gives us how often we'd expect them to occur together just through chance alone. To get the lift, we simply divide the actual rate of co-occurrence, support(Hass Avocado & Bag of Bananas), by this expected rate. By dividing the actual by the expected, it factors in whether the products are frequently purchased or low selling products, meaning the lift calculation avoids the problem of high selling products always scoring highly like we saw for the confidence score.
A lift>1 means that the co-occurrence is happening more often than we'd expect from chance alone, whereas a lift<1 means the products are occurring less often together than we'd expect. A lift=1 or close to 1 means the products are occurring together as often as we'd expect and so it's unlikely there is any association between them.
To calculate lift, we can take our 'support_xy' table and join the 'support_per_item' table to it twice. Once to get the support of X and a second time to get the support of Y. From there we can multiply through support(X) x support(Y) to get our expected rate and then compare the actual support of X & Y to get our lift:
We can see now that 'Hass Avocado' and a 'Bag of Bananas', as well as being the most commonly bought together combination of items, also have a lift of 2.51 which means customers are putting both in their baskets 2.51x more often than we'd expect. This suggests that there is an association between them i.e. customers who buy Hass Avocados are more likely to buy a Bag of Bananas at the same time and vice versa. This is useful to know as we could build on this association by cross-promoting the products on each others' landing pages or create a 'did you forget this' prompt when the customer comes to check out.
Another way to calculate Lift
In the previous example we used the support of each product individually to calculate the expected rate of the products occurring together. Another way to calculate lift if to use the confidence(X & Y) in conjunction with the support(Y). If we go back to our dice and coin toss example, the chances of rolling a 6 and getting a tails is 1/12. However, what if we've already rolled a 6? Then the chance of getting a 6 and a tails is 1/2 i.e. just the probability of tossing tails. If we know that one of the outcomes has already happened i.e. rolling a 6 then the probability of both outcomes occurring is simply the general probability of the second event occurring i.e. tossing tails = 1/2.
So for any product X, given we know it's already in the basket, we know the expected confidence between X and Y is just how often Y occurs in general i.e. support(Y). For example, we know that the chance of any basket having a 'Bag of Bananas' in is 12% i.e. support(Bag of Bananas) = 12%. Therefore, we'd expect the confidence of any product and a 'Bag of Bananas' to be 12% as that's just often the Bag of bananas occurs.
We can use this expected rate of occurrence to calculate lift by comparing it to what actually happens i.e. lift = confidence(X &Y) / support(Y). For example, the expected confidence between a Hass Avocado and Bag of bananas should be 12% as that's how often a Bag of bananas occurs. If the actual confidence between a Hass Avocado and Bag of bananas is >12% then we know they're occurring together more often than we'd expect so the Lift would be >1. Let's see how we this works in code form:
We can see that we get the same results as previously which is reassuring. This second way of calculating lift can be useful when we start to consider the co-occurrence of items that have a time element e.g. flour then eggs. We'll look at this in more detail in the next section but for now let's have a look at what some of the most popular associated products are.
When looking at items with the highest lift we normally also want to apply a support based filter too. If we're recommending products we want a good mix of a strong association but also not too niche/rarely occurring products. The more support a product has also suggests that the association is a true one i.e. based off lots of purchases/evidence rather than a fluke. Let's see what the top association is for our Hass Avocado and apply a condition that the other products needs to be in at least 1% of baskets:
It looks like our avocado is often bought with other salad items and in fact looking at the top products together e.g. lemon, tomato, and eggs we can almost start to see what recipes our customers might be buying it for e.g. smashed avocado on toast. We can see as well that all the top products are organic and scratch cook lines so we get extra insight into the types of shopping missions or customers who might be buying the line.
Let's have a look at some of the top associations from across all the products and we can plot the results to make them a bit more obvious:
We can that our top association is strawberries with frozen blueberries. Having a fresh line with a frozen one is interesting. Perhaps people are using them to make a smoothie or maybe there isn't a fresh blueberry line in the assortment? This is something we could check as if only a frozen version is currently stocked it could be a suggestion for the buying team to go out and source a fresh line. Alternatively if there is a fresh version we could compare the price of the fresh and frozen blueberries to see if the frozen are significantly cheaper which might be driving the association (and give us insight around the pricing of fresh blueberries).
Other associations all make sense such as the different flavoured, sparkling waters. We could offer suggestions to customers buying one type to try the others or offer them in a multibuy deal. The cilantro and jalapeno is interesting and looking at some of the other associations they're involved with e.g. red onions and limes, again we can start to build a picture of what customers might be buying and cooking to drive those association. Could we try offering recipes cards or an 'add to basket' for all 4 items if we think people are specifically shopping those lines for a specific recipe.
What about X then Y?
So far we've been looking at the co-occurrence of lines without worrying about whether X or Y was added to the basket first. Sometimes it can be useful to include the time element e.g. X then Y to understand how sequences of products can behave differently. For example, we might have determined that flour & eggs are associated but it might be useful to know whether it's eggs that makes the adding of flour more likely or vice versa.
To do this with our existing code, all we need to do is modify our query that counts how often both products are bought together. When we join the two tables to get all the other items that were also in the basket, we can now use the 'add_to_cart_order' column which records the order in which products were added to the basket. So now after joining, we can filter out products to only keep those in the second column that were added to the basket after the product in the first column:
The most common combination is still a 'Bag of Bananas' and the 'Haas Avocado' but we can see that bananas first then avocado is the top combination whereas the reverse, avocado then bananas, is only number 10. We can then take our new 'bought both' table and plug it into our existing framework to calculate the lift for each combination:
Quite a few of the top combinations are the same as before but there are also a few new ones.
Considerations: Time periods, causation, returns and substitutes
This last section just runs through some common watch-outs or considerations when running your market basket analysis. The first and hopefully obvious one is to consider the time period you're using when looking for associations. Too short a period and you might not have enough data to give confidence to your findings. Conversely too long a period might mean you capture associations that no longer hold or find them between products that no longer stocked.
The second watch out is to use your subject matter expertise to understand what might be driving the associations and be aware of any activities that might skew your results. For example, if you know that some products are offered on multibuy promotions or sit near each other in store, it's likely they'll have a strong association but this is down to decisions already taken by the business rather than uncovering something new.
The final watch-out is around returns data and substitutable products. We've been fortunate that our data doesn't look to include any returns data and it's not too much of a problem when looking at products occurring together in the same basket. Where it can cause issues is if we widen our analysis to look at say items bought by the same customer across multiple baskets or visits.
Returns can cause issues with market basket analysis as it's possible to end up recommending the same or a near identical product to customers. For example, if I buy a TV only for it to arrive broken, I'll return it and order a replacement. Analysis of my purchases will then suggest that customers who buy a TV are more likely to also buy a subsequent TV. Or customers might order the same item of clothing in different sizes or colours and then return any they don't like. This can lead to recommending the same dress or trainer to customers who have already bought it just in a different colour or size.
Ideally you want to try and identify and remove any returns data before running your market basket analysis. Another option is to filter any recommendations generated by the associations to say that the products must be from a different category or fulfil a different mission e.g. you can't recommend more trainers to someone who has just bought trainers but you can recommend socks.
If you'd like to learn about another technique for finding associations between products, there's a tutorial on Yule's Q available here.