top of page

Joining data using dplyr tutorial

In this post you'll learn how to merge data with dplyr using standard joins such as inner, left and full join and some tips and ticks for common challenges such as merging multiple tables with different join keys. We'll also look at dplyr's filtering joins, set operations and binding functions.

dplyr BB8.PNG

dplyr and the starwars data set

 

For this tutorial we're going to run through the different types of joins that are available in the dplyr package. If you're not familiar with dplyr there's a longer post all about data wrangling and manipulation with dplyr here. RStudio also have a handy cheat sheet and Hadley Wickham, the package author, has also written a book that's available for free online that has a section on joining data.

We'll be using the starwars data set that comes with dplyr for all the examples today. We'll create a couple of filtered tables from it so we can practice joining them together to see how each of the different joins in dplyr behave. So let's go ahead and load the package and create our data sets:

To keep things nice and simple we've got two main data sets we'll be using. I like to give my data sets names that help me remember what I've done to them so apologies they're not very original. The first is the 'humans_droids' table, which is a collection of all the human and droid characters which we can see have 41 rows and 2 columns. The second data set, which I've just called 'humans', is just the human characters which is 35 rows and 2 columns.

The syntax for joins in dplyr works the same as for other verbs. You can either use the form: verb(data1, data2) or we can use the pipe: data1 %>% verb(data2). I tend to use the latter when coding but will use the former for this tutorial as it looks a bit neater for showing the examples. We can see that they both work in the same way below:

Just like other dplyr functions, we can join combine joins with other verbs with the pipe %>% function:

Joining best practice

Before we have a look at how the joins works in dplyr it's worth mentioning two bits of best practice that apply to joining data in

most languages:

  1. Don't join on missing data: missing values will join to all other missing values so it's very easy to end up with one-many or many-many joins unintentionally. If the missing data occurs differently across tables you can also end up joining to a different row than the one you intended.

  2. Avoid many to many joins: this is where the column you're joining on across tables has duplicate row values in both tables. As each row matches to every other row you end up with a table much larger than your original e.g. if you had 10 duplicates in your join column in both tables your final table will be 10x10=100 rows.

There will be rare occasions when you actually do want to do one of these but more often than not they'll be done by accident and have the potential to give you strange results or depending on the size of your data, crash your session or server. You can see in the many-many join our table has gone from 87 rows all the way to 1,335 rows. Now imagine if we had millions or rows in our original join!

 

You can also see in the output how dplyr handles columns from different tables having the same name. If they're not specified in the 'by=' then dplyr keeps both sets and gives them a '.x' or '.y' suffix to show which table they came from. The first table in the join is given .x and the second .y. We'll see how we can change these if we want to later on.

Finally, unlike in SQL where you need to explicitly list which columns are to be used to make the join, dplyr is able to infer them for us if they have the same name in both data sets. This is a neat ability but I'd always recommend manually specifying them in the 'by=' as it's easy to unintentionally join on more or different columns than what you intended if you leave it up to dplyr to decide. You can see which columns dplyr picked to join on in the log for the second example:

Mutating joins: inner, right, left, full and cartesian joins

These joins are calling 'mutating joins' as usually you'll be adding columns to your original data set just like how the mutate() verb in dplyr is used for creating new columns. If you've ever used joins in SQL these are basically their equivalents in dplyr. Mutating joins keep all columns from both data sets and join based on matching rows. We'll run through each of these in turn starting with inner_join. Inner join only keeps rows that occur in both tables. We can see how this works with our example data sets:

Let's work through each stage of the join in turn. First up, inner_join() tells dplyr to only returns rows that are a match in both tables (but we haven't yet told it how to find matches). We then tell dplyr the two tables we want to use and in which order. The first table, sometimes also called the 'left table' or 'table x' is 'human_droids' and the second table i.e. the 'right table' or 'table y' is 'humans'. The 'by="name"' tells dplyr how we want to join the two tables i.e. we tell dplyr which columns the tables have in common and it's between these that it finds matching rows. The 'by=' column is sometimes called the 'join key'.

As the 'humans' data set is essentially a smaller subset of 'humans_droids' when we inner_join() them, we get a data set that has all the rows from 'humans', since all the rows from 'humans' are in 'humans_droids', and we gain an extra column 'species'. As an inner join only returns rows that occur in both tables, our new columns is complete/fully populated i.e. it has no missing values. The Venn diagram below shows a a more visual way to understand what's going on.

inner join.PNG
inner join text.PNG

Next up is left_join(), this keeps all of the rows and columns from the first data set and adds any new columns from the second data set. Where there is a match on our join key, these new rows will be populated with values from the second table. Where there isn't a match, the row from the first table is still kept and the new columns will simply be given a missing/NA value to show that there was no match in the second table for that row. Let's see how this looks with our data:

We can see our data set this time is 41 rows as it's essentially a copy of our first/left-hand table 'human_droids'. We gain the 'species' column from the second/right-hand table and where they had matching values for 'name' the corresponding 'species' value is brought over with the new column. Where there wasn't a match we just get an NA value. The Venn diagram for our left_join() looks like this:

left join.PNG
left join text.PNG

The black box around the left-hand table is there to emphasise that it's kept in its entirety after the join and info from the right-hand table is only appended where there is a match.

 

Having seen how the left_join() works in terms of keeping the first table and appending data from the second, what happens if we do our left_join() again but this time switch the order in which we join the tables?

This time we get a copy of the 'humans' table with a new 'homeworld' column from our second/right-hand table. We saw with our inner_join() how every row in 'humans' matches a row in 'humans_droids' and so every row of the new 'homeworld' column is populated with a value from our right-hand table on this occasion.

 

As well as left_join(), which keeps a copy of the first table and appends data from the second, there is its opposite, the right_join(). It doesn't tend to get used much as you can always make a right_join() into a left_join() by switching the table order but for completeness let's see it in an example. In fact we can rewrite our previous left_join() to work as a right_join():

right join.PNG
right join text.PNG

The final mutating join is full_join(). Whereas until now we've been only keeping rows that matched in some way a full_join(), as the name suggests, combines both tables fully so we keep all the rows and all the columns from both tables. As all the data from both table is kept it doesn't matter which is the left table or right table:

full join.PNG
full join text.PNG

A handy extra functionality of full_join() is that it can be converted into a cartesian or cross join. To do this we use our full_join() as normal, but it in the 'by=' we specify 'character()':

These types of joins are helpful if you ever need to compare all rows from one table against all other rows from a second table. For example, say we wanted to compare the similarity of each character's name to every other characters'. We've seen from the previous example that a cross join gets us every possible combination of character names (all 1,225 of them!). From this we can then calculate the edit distance between then, dedupe our logical pairs and sort our results:

 

Just like that we can see the top 10 unique name pairings that are the most similar to each other.

Tips and Tricks

 

The next section just runs through some examples of challenges that can drop up when joining data out in the wild and some with some hints and tips on how to overcome them.

What if my tables have names in common but I'm not joining on them?

We saw earlier that dplyr handles column names that are common across table but not in the 'by=' by using a '.x' and '.y' suffix for them. Let's remind ourselves of this by creating a new table 'humans_flag' that has a new column called 'species' that instead of recording the species of the character just takes the value 'yes/no' depending on whether the character is human. If we then join this table we can see what happens when we have two columns with the same name that we aren't joining by:

If we don't want to use these default suffixes we can supply our own with the 'suffix = ' option. We can pass a vector of suffixes, one for the first table and one for the second table. This way, rather than just recording the fact there were two columns with the same name in the join and which table they came from, we can try and make them more informative:

That looks a bit neater now and we can see that our two species columns are doing slightly different jobs for us. The first is a more general column that records the exact species of the character whereas the second is a binary indicator for whether the specific character is human or not.

What if my join keys have different names in each table?

For this we've got two options. We could use dplyr's rename() function to change the name of the join key so they match but this could get annoying if we've got lots of tables to join. An easier way is to deal with it in the join by telling dplyr what the different names of the join keys are. We do this by passing a vector to our 'by=' in the form of 'by=c(name1=name2)'. Let's see how this works in practice by changing the 'name' to 'character_name' and saving it in a new object:

What if I need to join on more than one column at a time?

So far all our joins have just been joining on one column at a time. Sometimes, particularly if we've got aggregated data, we might need to use multiple columns in our joins. Let's make some dummy data that records the average height and mass for characters but we'll group by 'species' and 'homeworld'. We can then also make a copy of the original starwars table but only keep the columns we need for the example:

If we now want to join our aggregated data onto our second data set we need to make sure we use both 'species' and 'homeworld' in our join otherwise we'll end duplicating out rows. For example, we've got two different averages for droids as some are from 'Naboo' and some from 'Tatooine'. We can't just join on 'species' otherwise every droid would get both of these averages which wouldn't be right.

 

We get round this in a similar way to how we did with the differently named join keys. When we join our data we pass a vector of column names in our 'by=' which tells dplyr we want to join on multiple columns at the same time:

What if I need to join on multiple columns and they have different names?

If we've got multiple columns we need to join on and they've got different names we can combine the two previous approaches. We still pass a vector to our 'by=' in the join but this time we pass a vector of join key pairs:

What if I want to join more than two tables at a time?

The easiest way to join lots of tables in one go is to pipe together all the different joins. For example, let's say we wanted to take our 'humans' data set, join on the 'homeworld' column from 'humans_droids' and then inner_join() against our average table:

Watch-out: Joined tables inherit the group_by() from the first table

This is a slightly unusual one but can be relevant if you're doing aggregating before or after joining your tables or if you've got everything in a long pipeline. Any groupings from group_by() that are on the first/left table are inherited to the table created by the join. It doesn't affect the mechanics of the join itself but obviously any subsequent aggregations you perform on the table after the join will be.

We can see that the first table has now got the group 'species' and our second table doesn't have any groups. Let's try joining them now and then checking for groups again. We'll use inner_join() and try swapping the order of the tables. The join results are the same each time but in the first example, the group_by() is inherited (as it comes from the x table) but in the second it is not (as it belongs to the y table in that instance):

Filtering joins: semi_join() and anti_join()

So far we've looked at mutating joins which modify the first table by adding columns from the second table. Next up are filtering joins. These work slightly differently in that the columns from the first table are left unchanged and no new ones are added. The join is more used as a way of filtering data i.e. removing rows based on how they match to another table.

 

For example, we might have a list of dates, going back years, with the sales for the day recorded in one table. In another table we might have 50 columns, one of which is the current financial quarter dates. To just get the sales for the latest quarter we could inner_join() the tables but then we'd also bring over 49 extra columns we don't need from the second table. We could drop these with a select() as part of our join or we could use a filtering join instead.

 

In this example, we'd want to use a semi_join() which only keeps rows in the left table that match the right table. Let's compare how it works vs an inner_join():

dplyr semi join.PNG
semi join text.PNG

You can see the difference as with the inner_join() we get the extra 'species' column from the right table whereas for the semi_join() we don't.

 

Another neat feature of filtering joins is they don't mind if you have duplicate values in your join keys. They'll never duplicate out rows if they find one-many or many-main joins, they only ever keep or drop rows. If we already have duplicate rows in our left table these will be preserved, we just won't get any more from joining to our right table.

The second type of filtering join available is called the anti_join(). Whereas the semi_join() only returns rows that match in both tables, the anti_join() returns rows from the left table that don't match rows in the right table. This can be helpful if say you've got a list of exclusions in one table that you need to apply to another. For example, let's say we've still got our sales by day table but we want to remove public holidays as we think these might skew our data. If we have a second table of public holiday dates we can anti_join() against this to only keep dates in our first table that aren't in the public holiday table.

Let's see how we can use it to remove all of the human characters from our 'humans_droids' data set:

dplyr anti join.PNG
anti join text.PNG

Set operations: union_all(), union(), intersect() and set_diff()

I tend not to use the set operations in dplyr all that much but I'll include them here for completeness. Apart from union_all(), they all require you to have the same columns in both data sets. They can be useful for things like version control or finding differences in tables after updates or refreshes. The other thing to note about the set operations is that the three of intersect(), union() and setdiff() remove duplicates from both tables. At the end of this section I'll do a comparison between how mutating/filtering joins handle duplicates compared to set operations. For now, let's start with the one set operation that doesn't remove duplicates: union_all().

So far we've joining tables side-by-side i.e. left and right. But what if we wanted to join top to bottom i.e. append new data to our table? This is where we can use union() and union_all(). The key difference is that union_all(), as the name suggests, keeps all the rows and columns from whichever tables we join. This means if we have the same row in both tables, we'll have that row twice after our join. The way union_all() works is it also keeps all of the columns from both data sets so you can end up with a block of NA values if the columns don't match:

The new data set is 76 rows long which is a combination of 35 rows from 'humans' and 41 from 'humans_droids'. Since the 'homeworld' column from the second table isn't present in the first, it is created but just filled with NA for all rows that came from the first table. Likewise, observations from the second table will have NA for the 'species' column from the first table. If we do the join again but this time sort the data we can see how the data has been appended and see how we have duplicated rows for characters that were in both tables:

union all.PNG
union all text.PNG

If we want to append data but without having any common rows occurring twice, then we can use the union() function instead. It works in exactly the same way as union_all() except it removes any duplicate rows from the final data set. To do this though it imposes the condition that the columns for both data sets have to be the same. If we try to either union() tables with different column names or with a different number of columns, we'll get an error message:

We can see that the first table has now got the group 'species' and our second table doesn't have any groups. Let's try joining them now and then checking for groups again. We'll use inner_join() and try swapping the order of the tables. The join results are the same each time but in the first example, the group_by() is inherited (as it comes from the x table) but in the second it is not (as it belongs to the y table in that instance):

This time we only get 41 rows of data as any duplicates have been removed. Out of all of dplyr's set functions, I tend to use union() the most as it's handy to be able to append and dedupe data. Rather than use union_all() I'd simply bind the rows which we'll how to do in the next section.

union.PNG
Union text.PNG

The last two set function I hardly ever use as they're very similar to the filtering joins we've just seen except they're more restrictive. Whereas the filtering joins allowed us to remove rows from 'table x' based on rows in 'table y', the two set functions intersect() and set_diff() do the same but also require us to have the same columns in both data sets. As a bonus though they do remove all duplicates, even those already present in table x before the join, unlike filter joins which only ignore/remove duplicates that'd normally be caused by the join.

Let's have a look at intersect() to start with which only returns rows that match across both tables:

inner join.PNG
intersect text.PNG

The last set function is set_diff() which like anti_join() only keeps rows in the left table that aren't in the right table. Again it requires both tables to have the same columns to work and will remove duplicates it finds:

dplyr anti join.PNG
set diff text.PNG

I'll end this section with one final demonstration of how mutating, filtering and set joins handle duplicate values differently. Mutating joins don't remove any duplication, either in the original tables or any caused by the joins. Filtering joins remove/ignore duplicates that would normally be caused by the join but not from the original table. Set functions (apart from union_all()) remove duplicates from all the data. We can see this in the example below where the 'humans' data set is union_all()'d and then joined against itself:

Binding tables with bind_rows() and bind_cols()

Last up, we'll look at two ways of joining data using bind_rows() and bind_cols(). These are essentially dply's slightly faster versions of base R's rbind() and cbind() functions. These two functions allow us to append data, either rows or columns, without having to worry about join keys.

 

First let's look at bind_rows() which functions pretty much the same as union_all(). From the help file we're told 'When row-binding, columns are matched by name, and any missing columns will be filled with NA.' This means we get extra rows appended on the end of our data set, any duplicates are preserved and any columns not present in both tables are carried over to the new table with the rows populated with NA:

The output looks just like our union_all() example. In fact, we can use the setequal() function to test whether the outputs of the two functions in this scenario are completely identical:

Next up is bind_cols(), which allows us to bind extra columns to our data set without having to specify a join key. If we look at the help file entry, we can see that the bind_cols() function uses the row number/index to perform the join: 'When column-binding, rows are matched by position, so all data frames must have the same number of rows'.

 

This is handy when working with the output of something like the predict() function which generates a vector of model predictions which you might want to then join back to your test data set so you can measure performance. In the example below we'll split our original 'starwars' into two tables and then bind_cols() them back together:

We can then use the setequal() function again to show that the bound version accurately recreates the original data set:

As well as needing to have the same number of rows in both tables, as the join takes place on row position, it's important the tables are also in the same order. For example, if we recreate the previous example but add an interim sorting step, the tables will be in different orders but the bind_cols() will still run without any warnings or errors:

We can see that even if with the tables out of order, the initial bind_cols() still runs and unless you know your data really well it isn't super obvious the join hasn't worked as we intended. We can see though that our setequal() now returns FALSE.

That concludes this tutorial on joining data with dplyr. If you'd like a more detailed tutorial on manipulating data with dplyr you can find one here or you can see how these joins can be put to use in a project with this post on market basket analysis using dplyr.

dplyr
Filtering joins
Set operations
Tips and tricks
Mutating joins
Best practice
Binding rows and columns
Copyright © 2024 Step by step data science. All Rights Reserved
bottom of page