Finding associations in R with Yule's Q and hierarchical clustering

In this tutorial you'll learn about the Yule's Q measure of association and how it can be used to understand the associations between products in a category. We'll also see how we can express these relationships visually using hierarchical clustering.

What is the Yule's Q measure of association?
Instacart grocery data set from kaggle
Prepping the data
Create our base tables
Calculate Yule's Q
Cluster the data and plot the results

What is the Yule's Q measure of association?

Yule's Q is a measure of association between two dichotomous variables (categorical variables with only 2 levels) e.g. 'yes/no' or 'bought/didn't buy'. Like market basket analysis, it's often used as a way of finding associations between products or activities based on customers' behaviours. Unlike market basket analysis, which might look at the strength of association between pairs or sets of products, Yule's Q can help us understand the association between all the products in a category and how they relate to one another. We'll see how we can combine it with hierarchical clustering to easily visualise these associations too.

As Yule's Q measures the association of two events, each with two possible outcomes, we can represent all the possible outcomes in a 2x2 matrix like the one below. Each letter, A-D, records the observed frequency of each outcome. In our example matrix, the two events are the customer buying or not buying Item 1 (English Breakfast Tea) and buying or not buying Item 2 (Green Tea).

The green oval that includes A and D captures the number of times customers treated Item 1 and Item 2 the same i.e. either buying or not buying both.

A = the number of times customers bought both Item 1 and Item 2
D = the number of times customers didn't buy Item 1 and didn't buy Item 2

The red oval records the number of times customers treated the products differently i.e. they bought one of the items but not the other:

B = the number of times customers bought Item 1 but didn't buy Item 2
C = the number of times customers didn't buy Item 1 but did buy Item 2

From these tallies of A, B, C and D we can then compute our Yule's Q using the formula:

Yule's Q can be thought of as a measure of the 'agreement' vs 'disagreement' between pairs of dichotomous events. For example, a customer buying or not buying both English Breakfast and Green Tea shows 'agreement' i.e. customers treat both products in the same way (both bought or both avoided). This is then compared to the 'disagreement' e.g. customers buying one but not the other i.e. treating the products in a different way. Products that have a higher rate of agreement than disagreement can then said to be positively associated in some way.

If you're familiar with the Odds Ratio, Yule's Q is actually a transformation of the Odds Ratio so it scales between -1 and 1 which makes it easier to work with. The rescaling is helpful because 1 becomes a perfect association, -1 a perfect negative association and 0 being neutral. This makes it easier to compare the associations across all products within a category as they all sit on the same scale, whereas for something like market basket analysis or the odds ratio, different pairs of products could have very different values e.g. one pair of products might have Lift=2 and the same Item 1 with a different Item 2 might have Lift=25.

The idea that customers not buying both products somehow contributes to a positive association between them can seem strange at first but does make sense. For example, we might have a few products in the range that are much more expensive than the rest and so whilst only a few customers might buy both, we see lots of customers buying neither which suggests there's something about both SKUs that customer don't like i.e. they're both very expensive. Likewise, a vegetarian will avoid both chicken and beef and so we can infer from their avoiding both that the products are similar in some way i.e. they share some property that makes both undesirable to the customer.

This is one of the useful properties of Yule's Q and gives it an advantage over something like market basket analysis which only considers the co-presence of item pairs as a positive association between them. In market basket analysis, we can only ever have a Lift > 1 for products if they are often bought together whereas Yule's Q can infer associations from co-avoidance.

Another advantage of Yule's Q compared to market basket analysis is that it only considers how often the products making up our pairs are/aren't bought together rather than how often they are bought in general. For example, a line that is bought by 50% of customers, can only ever have a maximum Lift of 2 with another product with market basket analysis. A less commonly bought product however can have a much higher Lift. For Yule's Q though, any lines that are nearly always bought together will have a score close to 1 regardless of how often they are bought in general.

One of the downsides with Yule's Q is that as we multiply AxD and BxC, if any of the four possible outcomes (A-D) record a 0 frequency then this will default the score to either be 1 or -1 regardless of any other values. For example, if 0 customers buy both products but 100 buy neither when we calculate AD we get (0 x 10) = 0 agreement so our score will default to -1. This means Yule's Q tends to work best with popular or high selling lines and you need enough data so that each of A-D has some observed frequencies.

Now we've discussed how to calculate Yule's Q and some of it's advantages and disadvantages, let's have a look at how we can apply it to some real data.

Instacart grocery data from Kaggle

For this tutorial we'll be using the Instacart data from the 'Instacart Market Basket Analysis' competition on Kaggle. You can download the data here. It's split across a few separate files that we'll need to load and merge. We can load the data and then do some of the pre-processing before calculating our associations:

The 'orders' table has an ID for the order and user as well as some information around the time and date the order was made. This links to the 'order_products_prior' table which records which products were added to the order and in which order. We've then got a couple of product and category lookup tables which have some more detail about what the products are called and the names of the categories they're in.

We've got a couple of choices in how we want to work with this data. We've already looked at the associations between products that occur in a basket together in the market basket analysis tutorial. Yule's Q tends to be most helpful in understanding how all products in a category relate to one another as a group. To do this, rather than look at products that occur in a basket together, we'll instead look at products that are bought by the same customer.

The idea behind this is that products that are very similar may in fact rarely occur in a basket together. For example, I might buy strawberry yogurt one week and raspberry yogurt the week after. I obviously like both types of yogurt but I actually rarely buy them together because by putting one in my basket it removes the need for the other. Therefore, to understand how I as a customer shop the yogurts category, we need to look at my full purchase history as a whole rather than on a basket by basket basis.

Prepping the data

As we're trying to understand how products relate to other another across customers' total purchase histories, rather than just within baskets, we'll need to make sure we pick customers who have at least shopped the category a few times. For example, if I only shop once and all I buy is strawberry yogurt, it'll look like the product isn't related to anything else. However if we have at least a few of my shops we can see that I buy strawberry or raspberry yogurt and can infer they must be related in some way.

As mentioned earlier, one of the challenges with Yule's Q is making sure we have values for each possible combination of behaviours for each of our item pairs so we also need to filter out niche or not very commonly bought products. Another consideration is that later on we'll want to visualise the associations using hierarchical clustering. This technique is sensitive to any noise in our data so we want to make sure any associations we find in our data are based off a large enough sample size to be robust. For example, we might have two pairs of products both with a Yule's Q score of 0.8 but one pair might be based from 1000s of customers shopping the lines whereas the other might be from only 10s.

Let's start off by having a look at how many customers we have and how often they shop to inform some of these decisions.

It looks like we've got over 200k customers in our data set which is quite a lot so hopefully most products will have enough customers purchasing them. Let's have a look at how often these customers are shopping:

Most customers appear to have shopped multiple times (that spike at 100 orders suggests Instacart might have truncated any orders past that point) which is great. By having multiple visits, rather than a few, we'll get a better idea of any themes or trends in how customers shop a category.

So it looks like the most popular aisle doesn't have a name. That's unhelpful! The fact every customer appears to have 'shopped' it suggests it might be a fixed fee like a delivery charge so we can ignore it for now. The other aisles all look like they have high customer numbers in as well. Let's pick one and create a subset of all the data for it. I've gone for 'tea' but you can pick another one if you like. I'll save the selection into a new object that then gets used in a filter() in the code.

The reason for calculating the associations on a category at a time is that later on we'll need to cross join all the products in the category which can make the table rather large. Let's apply our filter and see how many customers and products we're left with:

It looks like we've got 53k customers who between bough 894 different type of tea. These are good volumes again so we can now apply our conditions for the minimum number of customers shopping the lines. There isn't a set rule on how many customers need to have purchased a line but I try to find a number that is large enough to give me confidence that any associations found aren't down to noise whilst also keeping a good number of the products from the range.

For now I'll apply a filter that says customers need to have had at least 3 transactions in the tea category and that each product need to have been bought by at least 100 of these customers to be included in my analysis. Another way to do it is if you have sales data you could just keep the products that account for say 80%+ of the sales in the category so you know that they are popular lines and your analysis covers the most important lines.

Our exclusions mean we now have 17k customers and 194 lines. This is quite a big drop and might be more than we want. If we had sales data we could see how much sales cover the 194 lines gives us in the category. Having used the Instacart data for other tutorials it does look like it has a very large tail of infrequently purchased items so I'm happy we've probably still captured the most important ones.

As we're interested in customers and which items they've purchased we can do a final few bits of data prep to shrink the size of some of our tables. First up we can get a unique list of customer and products bought. What we'll be counting is the number of customers who have bought the line at any point during the time period rather than how many times customers have bought them. This ties back to Yule's Q being between dichotomous variables (e.g. did buy/didn't buy) rather than continuous (e.g. how many did they buy).

Creating our base tables

Now we've prepared our data we can create the base tables we need for calculating our Yule's Q score. If you've read the market basket analysis tutorial you'll probably find these familiar and a lot of other measures of association also use the same base tables we're about to create.

For each product, the total number of customers who purchased it
The total number of customers who shopped the category.
The number of customer who bought both the products.
Every combination of possible pairs of products in the category

Let's calculate each of these now, starting with the last one first. What we need is essentially 'every product x every other product in the category'. This forms our base table of every possible pair of products in the category that we can then join each of the different 2x2 A-D fields that we calculate onto.

We can do this by converting a full_join() from dplyr into a cross join by adding the 'by = character()' option. If you'd like to know more about joining data with dplyr there's a separate tutorial here.

We can see that our 194 products now gives us 37,636 (194^2) pairs of products. Let's calculate some of the other fields we need for our Yule's Q score and we can add them to our table. The only one of these we need to calculate directly is A (the number of customers who buy both products). All the others we can derive from our base tables with a bit of clever subtraction. Let's go ahead and calculate A now though:

How this works is we take our base table of ever product combination and then for the first product in the pair, we join on all customers who bought that product. The next join we join all the customers who bought product 2 i.e. the second product in the pair and we also join on 'user_id' so this way we only join customer who bought product 2 that also bought product 1. From there we can group by the pair of products and count how many individual customers bought both.

If that's a bit much all in one go here's what the output of the joins looks like as an interim step. You can see that what we get is essentially a list of every customer that has ever bought both of the product pairs i.e. product 1 and 2:

Now we've got A calculated, we can create our next two base tables and then join them all to calculate the other fields with some clever subtraction. Thankfully, these two are much more straightforward to create. First, for each product, we need a count of the total number of customers that bought it:

Finally we just calculate the total number of customers in the category:

With these 4 base tables we've not got all of the data we need to create our Yule's Q score.

Calculating Yule's Q

With our 4 base tables, we've got everything we need to derive the different elements of the Yule's Q calculation. I'll go through each part in stages as it looks a bit complicated all in one go. Firstly, we need to join all of our tables using our every-pair-of-products 'all_combinations' table as the base. Next up, we join the customers-per-product count from the 'user_per_product' table. We actually join this table twice in succession. The first time we join to the first product in the pair i.e. product 1 and the second time we join it to the second product in the pair i.e. product 2

The result is we now have for every product combination, the total number of customers who bought the first product in the pair and also the total that bought the second product in the pair. Users that bought both product 1 and product 2 will be counted in both columns which is why we next join on the total count of customers that did by both products. Let's add this step to our existing query:

So now we have total customer buying product 1, total buying product 2 and also total buying both for every pair of products. You might be able to guess where this is going in terms of how we can use these to calculate B and C but don't worry if not. Let's add on our final base table which is just the count of all customers in the category. We don't even need to merge this on but rather can just use mutate() with pull() to extract the value out of the table directly:

We've now got all the fields we need to calculate A-D and then our Yule's Q. I'll put the formula below again so we can see how each part relates to the final calculation.

We know we've already got A, the number of customers that bought both products in the pair. With that and the total number of customers per product and the total number of customers, we can derive B-D as follows:

B = all customer that bought product 1 - A i.e. all purchasers of 1 - those that bought 1 & 2 = those that only bought 1
C = all customer that bought product 2 - A i.e. all purchasers of 2 - those that bought 1 & 2 = those that only bought 2
D = all customers - A - B - C i.e. anyone not accounted for in A-C must be D.

From this we can calculate our Yule's Q score. Let's see how this works when we code it up:

Looks like we've successfully calculated the Yule's Q score for each pair of products. Well done! We can see that 'Unsweetened Premium Iced Tea' has a perfect association to itself which is reassuring and also a weakly positive one to another 'Unsweetened Iced Tea'. There's what looks like an Egyptian Liquorice tea with which it has a fairly strong, negative association which makes sense as they sound like quite different products.

You'll notice there's an extra column on the end called 'Yules_rescaled' which we'll be using in a bit for our clustering. All it does is '1 - Yule's Q' which has the handy property of converting our -1 to 1 Yule's Q scale to be between 0-2 instead with 0 becoming the perfect association and 2 being a perfect, negative association. The reason for doing this is that clustering algorithms work off a notion of 'distance' between elements and won't work with negative values for distance.

Visualising associations with hierarchical clustering

A nice way to visualise all the different association scores is to use hierarchical clustering with a dendrogram (the tree like structure at the start of this post). The method we'll be using is 'bottom up' or 'agglomerative' which means each product starts as its own cluster, then the closest two are combined into a cluster and then the next product/cluster are combined and this carries on until everything is in one big cluster. At each point the total average distance of the cluster members is calculated from the cluster centroid and so we can see how closely associated different products are to each other as well as to other clusters.

For example, we might end up with a cluster of all the different iced teas which would make sense. However by using a bottom-up approach we can specifically see which iced tea products are most similar to each other first and then which other products they're most similar to next and then we might see that 'ice tea' products as a cluster sit closer to a 'kombucha cluster' than say a 'breakfast tea' grouping.

By seeing how the products relate to one another we can gain useful for information not just about our products but also our category. For example, products that are closely related might indicate they are substitutes or it may indicate they're complimentary in some way. This could then help us with a cross-promotional strategy or merchandising decision to site the products next to each other on the shelf or website.

As it is an unsupervised method i.e. we have no predefined labels or categories and only group products according to how customers shop the category, we might discovered new relationships or product groupings. For example, there might be some brands we consider to be premium in our pricing hierarchy but actually they come in a larger pack size so their relative price is cheaper and so they're actually bought with more mid-tier products.

Before we can run our clustering algorithm we need to do a little bit more data prep to present our data in a way that the stats package in R will accept. This involves converting our table to a matrix, adding in row names and removing any punctuation just to make the output a bit easier to read:

I've truncated the output as it prints the entire table but you can see that what we have is a matrix that compares every product in a row to every product as a column and uses the rescaled Yule's Q as the distance values. From this we can now run our clustering. For the interpretation of the distance between the different products we'll use euclidean distance which calculates the distance as if the length of a straight line between two points. For our clustering algorithm we'll use Ward's method which aims to minimise the total within-cluster distance of the points from the cluster centres and tends to give more balanced clusters.

That looks pretty good! One challenge with the dendrograms is they can get rather large. I'll post a zoomed in version on some of the clusters below so we can look at them in more detail. In terms of interpreting our clusters for now though, each vertical line represents a cluster and the height of the line represents the distance in terms of how far away each of the objects are from each other. The horizontal line connects the different clusters as they ladder up in the hierarchy.

We can see how each individual products starts as its own cluster at the bottom and then gets joined to the nearest product or cluster with a flat line and this keeps happening until all the way at the top we have one cluster with everything in it. We can read the dendrogram 'bottom-up' to see which products sit closest to other products or we can go top down to get an idea of how many broad groupings we think we might. A useful way to do this is to imagine a horizontal line cutting across the data, moving from the top to the bottom. Each vertical line it cuts across then tells us how many clusters we have at that level. As the line gets closer to the bottom of the chart, we'll have more clusters but the average distance between the members will be smaller.

I've just drawn this one on the image but we can add it to our plot in R and also make it a bit more visually appealing. It looks like we've probably got 3-4 larger clusters. This is where subject matter expertise is really valuable in investigating the different groupings and seeing if they make sense. We can see that if we move the line down more we'll have 4 clusters with 2 on the left, 1 large on in the middle and 1 on the right. It looks like the cluster on the left that splits into 2 would actually make for 2 quite small clusters and they both sit quite far apart from the centre and right-hand cluster so we might decide to keep them together and stick with 3 clusters overall.

Let's say we were happy with our 3 main clusters. We can use the dendextend package to add in a dashed line for us and also recolour our dendrogram elements to make the different clusters more obvious:

This looks a bit nicer and we can see each of the main clusters much more clearly now. As well as looking at the dendrogram there are a couple of more statistical techniques we can use for finding the 'optimal' number of clusters. I say 'optimal' as we're doing unsupervised learning so we don't have a clear error rate we can use to measure performance. They can act as useful decision aides, just like the dendrogram, but often with clustering the use case will determine what number of clusters is optimal

For example, there's no point picking 20 clusters if we can only have say 10 subcategories listed on the website. Equally in our example there is a case that could be made for a 4th cluster but it'd be quite small so we need to consider if the trade off in the extra admin and complexity of that cluster will be repaid in the value it can generate. Maybe if the 4th cluster was full of top selling lines or unearthed a previously overlooked subgroup of customer activity it'd be worth splitting off. If it turns out it's a slightly more premium version of an already premium cluster then it's probably less useful.

The factoextra package offers us two ways of trying to identify the optiamal number of clusters. We'll be using the fviz_nbclust() function with two different options. The first method is the 'wss' which stands for 'within-cluster sum of square' and creates a plot of the total within-cluster sum of square for each number of clusters that we could choose.The within-cluster sum of square is the distance each cluster member has from its cluster centre squared and then this is summed across all clusters to get the total. It essentially measures how compact our clusters are.

The total distance will naturally go down the more clusters we have so again it's up to us to pick a suitable trade off point. Usually the total distance decreases quite a lot with the first few clusters before gradually tapering off as end up splitting cluster that are already pretty compact. Often (but not always) there can be an inflection point referred to as the 'elbow' where we can see that adding an extra cluster doesn't result in a smaller distance and so we might pick the inflection point as our trade-off point and best number of clusters.

The second method uses what's called a silhouette score which compares how closely associated an object is with its own cluster vs another cluster. Ideally we'd have objects that are closely associated with their own clusters (compact clusters) and not associated with other clusters (good separation between clusters). The silhouette score is expressed between -1 and 1 with 1 being the highest/best score. Let's try running both of these measures and have a look at the results below:

We can see that the within-cluster sum of square approach doesn't really give us a clear 'elbow' but maybe suggests that 4 clusters gives the best trade off between decreasing distance vs additional complexity of having more clusters. This makes sense as we can see in our dendrogram that the length of the vertical lines get a lot shorter past the 4 cluster mark. Interestingly the silhouette score suggests that 2 clusters gives us the best ratio of compactness and separation. Again this makes sense from our dendrogram as the left clusters split quite early and sit further away from the middle/right-hand side clusters.

Both approaches are useful to factor into our decision making and it's not uncommon to get slightly conflicted or indeterminate results like this on real-world, messy data. This is why subject matter expertise and usability often have the final say when determining the 'optimal' number of clusters.

As well as the broad, macro groupings a lot of value can be gained from investigating the more granular clusters to understand what products sit close to each other and what factors might be driving this. For example, when I was looking at the clusters I saw a group that nearly all had 'Yerba Mate' in the name. According to Wikipedia it's a 'plant species...native to South America' and often used as a health food.

We can see that nearly all the Yerba mate products sit closely together in their own group with a couple of other green teas (maybe also in the group for their health connotations). This is interesting as there's obviously a group of customer purchasing different Yerba mate products to drive this association. When looking at the Instacart webpage for the Safeway supermarket what's interesting is that some of the teas are obviously popular enough or have enough funding behind them to appear as 'sponsored' products but that the webpage for 'tea' lacks the ability (at the time of writing) to easily filter for these lines:

In fact there appears only 3 ways to filter for different types of tea: brand, bottled, leaf or powdered. A good use of the clustering would be to identify other small but significant groups of product e.g. green tea and these could be added as filters on the website to make navigating it easier.

The last little bit of code for now is, once we've finally decided how many clusters we'd like, we can ask for the final assignments to be made and extract the cluster for each products using the cutree() function:

This gives us each of our objects (in this case the product name) and its final cluster assignment. We could then merge these back onto our original data that we read in to calculate metrics such as how many customers shop the clusters, what the overlap between them is and what other categories these customers buy.

Congratulations!

That concludes this tutorial on how to calculate product associations using Yule's Q and visualise them with hierarchical clustering. Hopefully you've found it useful and feel free to try running the code on some different categories to see if you can find any more interesting associations. If you like to learn more about another form of association mining, market basket analysis, there's a tutorial on that here.

What is Yule's Q?

The data

Yule's Q

Cluster

Base tables

Prepping the data

Finding associations in R with Yule's Q and hierarchical clustering

In this tutorial you'll learn about the Yule's Q measure of association and how it can be used to understand the associations between products in a category. We'll also see how we can express these relationships visually using hierarchical clustering.

Contents

​

What is the Yule's Q measure of association?

Instacart grocery data from Kaggle

Prepping the data

Creating our base tables

Calculating Yule's Q

Visualising associations with hierarchical clustering

Congratulations!