Data Wrangling in R with dplyr tutorial
In this post you'll learn how to use the dplyr package in R to manipulate and summarise your data with dplyr's 5 main verbs: select, arrange, filter, mutate and summarise.
Contents
dplyr: part of the tidyverse
dplyr belongs to a collection of packages known as the 'tidyverse'. Its main author is Hadley Wickham who is the Chief Scientist at RStudio. The tidyverse provides a range of packages for data science that all operate in a consistent and coherent way and benefit from ongoing support and development from RStudio.
"The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures." - tidyverse.org
This common underlying design and philosophy, in my opinion, removes one of the main barriers to learning R. Often in R there are so many different ways to solve a problem that it can feel like there's an overwhelming number of packages and coding styles you need to learn. Thankfully the tidyverse helps with this.
So if we go ahead and load the tidyverse, we can see all the packages it includes along with dplyr. You might recognise some of the other names, such as ggplot2 or stringr, as along with dplyr these are some of the most commonly used packages in R.
The dplyr package also comes loaded with some example data sets and we'll be using the 'starwars' data set today. After loading the tidyverse we can simply call the data set 'starwars' to get a print out in the console of the first few rows:
We can see that our data set is a 'tibble', a modern version of a traditional R data frame and has 87 rows and 14 columns.
dplyr syntax
In dplyr, the first argument of any verb e.g. select, filter, etc. is the data set name and the second argument is the action you want to perform i.e. verb(data, action). For example:
This takes the data set "starwars" and then applies a filter to the species column to only bring back records that are equal to the text string "Human". Notice how after telling the verb which data set you want to use you can refer to the column by name (like you would in SQL) without the need for $ or "quotes" like you would in base R. The dplyr filter also automatically drops rows with NA/missing value unlike in base R. To do the equivalent filter in base R would require:
We can start to see already how the tidyverse makes using data in R a bit simpler. This means we can spend less time worrying about our code and more time doing our analysis.
The pipe %>% operator
One of the key features of the tidyverse is the pipe operator which is written as %>% and essentially means "and then". The pipe allows you to chain together different bits of code and operations into one readable code block. As everything is executed sequentially it means the operations read much like they would in a written sentence. For example:
We can read this as "filter the starwars data set, for all rows that have their species as 'Droid' and then sort the data set by height so the shortest droid appears at the top". Which we can see from the output is exactly what happens:
We can make use of the 'and then' functionality of the pipe to rewrite our verbs so that we declare the data set we want to use in advance and then apply the verb to it. For example both of these bits of code do exactly the same thing:
In the second example we no longer need our data set inside the verb as we've already referenced it and then passed it via the pipe to filter(). You can write your piped code all on one line but for longer pieces it's generally considered best practice to have each new action on a new line:
To make it even easier to use there's a handy keyboard shortcut for creating pipes in R studio: 'cntrl + shift + m'. Knowing this when I first starting learning R would have saved me a lot of unnecessary typing!
Selecting columns with select()
We can use dplyr's glimpse() function to get a quick look at our data:
Glimpse() is a handy way to see the structure of the data we're working with as we get a summary of the different dimensions, the format for each of our columns as well as a print out of some of the data. We can see that there are 87 rows and 14 columns in the data set with a mix of data types.
The last 3 columns (films, vehicles and starships) are a special types of columns called 'list columns'. As discussed in the introduction to R post, data frames/tibbles in R are built on top of lists which is how they can hold data of different types. Lists are also quite happy containing other lists which allows for the creation of list columns. List columns allow us to hold nested data in data frames but can be quite confusing to work with so for the purpose of this post we're going to remove them. Which is where select() comes in!
The select() verb lets you select columns from your data frame by name, each separated by a comma, just like you would using SQL. For example, we can declare our data set, pass it to select with the pipe and simply type out the columns that we want to bring back by hand:
Note that we only have 11 columns now rather than the original 14 and that the columns appear in the order we specified them. The number of rows is unchanged.
We can also tell dplyr to bring back all the columns apart from the ones we don't want by using a - sign in front of them:
We can also bring back all the columns between a range of column names by specifying our start an end columns separate by a colon :
Rename your columns with rename()
Whilst we're selecting columns we might also want to rename some of them. We can do this using the rename() verb which takes the following form:
Say for instance we wanted to change the column "height" in the starwars dataset to "height_cm" to show the unit:
Sort your data with arrange()
The arrange() verb changes the sort order of your data set, similar to how order by works in SQL. By default, it sorts in ascending order (smallest to biggest). If we look at the original starwars data set we can see that Luke Skywalker is the first character to appear:
However, say we wanted to see who the shortest character is (any guesses for who this might be?) we could write:
It's Yoda! If we want to change the sort order to be descending (highest to lowest) we can use desc() within our arrange(). So to get the tallest character:
We can also sort on multiple columns at the same time. The below example sorts by homeworld ascending by default (so A-Z) and then by each characters height which we specify to be descending (tallest to shortest):
Select rows based on their position using slice()
The slice() function uses the row numbers of the data set to filter the data. It can be a handy way to bring back the top/bottom observations when combined with arrange(). For example if we just want the first 3 rows of the data set we can use:
But if we wanted the shortest 3 characters we can combine it with arrange():
To filter on more than just row numbers however we need to use the filter() verb.
Subset your data with filter()
The dplyr filter() verb allows us to subset our data based on values in the rows. We can use filter() with the standard comparison operators in R and have already seen how it takes care of dropping rows with NA for us when applying these conditions:
In our early example we filtered on 'species' to bring back all the human characters. Say now we want to bring back characters who are both Human and from Tatooine. We can do this by using filter() and combining our conditions with the & operator:
The & operator can also be written as a comma, which can look a bit neater if you have lots of different conditions but is potentially less clear if you start mixing it with other operators:
What about if we wanted characters that are either Humans or Droids? For this we use | to represent 'or':
When filtering with multiple 'or' conditions from within the same column it can be easier to pass a vector of conditions to the filter using %in% rather than writing them all out separated by | each time. In the previous example we were after characters whose species was Human or Droid. We can rewrite this using %in% as follows:
The %in% operator also works with vectors of numbers. For example say we want to bring back characters whose birth year is one of 15, 19 or 21:
Making robust filters with pull()
Where possible, it's good practice to avoid typing out values manually or 'hard coding' them as this can make your code vulnerable to typos or changes in the underlying data.
Say for example we had some old code that we wrote back when the original star wars film was released in 1977. Let's say at the time we wanted to filter for all the droid characters. As there weren't that many Droids in the first film we decided it was easy enough to type out all their names by hand:
This can be a nice way to construct a filter as it makes it very obvious exactly what data is being returned and meets our goal of bringing back all the droid characters. However it also creates a dependency where for each subsequent movie that introduces new characters we have to go back to the code and add in each new character's name.
This could quite quickly get tiresome and leave room for unintentional errors.A better approach would be to make our code more generic and robust in the first place. We know that the original purpose of the code was to filter for droid characters. So rather than type them all out by name we can use a more generic filter on species:
What happens though if the underlying data can be even more variable? Say for example we have a script that runs each week to bring back the physical attributes (height and mass) for whichever characters are trending on twitter. These might all be completely different characters with no common characteristics like species for us to filter on. This is where dplyr's pull() function can come in handy.
We use pull() by passing it a data set and a column and it 'pulls' all the values into a vector. We can then either save this vector as a new object or combine it with %in% as part of our filter to create a dynamic filter condition.
In the example below I create a dummy data set 'trending' that represents the different character names that might be trending on twitter that week. In practice this file might come from a separate system that you read into R but for now I'll take 10 characters at random using sample_n(). We can then use the pull() function to bring back all the physical characteristics for these characters (if you run the code you might get different characters back as it takes a different random sample each time it runs).
Whereas before we were able to declare the data set and then pull the column, as we're now using two different data sets, 'starwars' and 'trending', when we write our pull we need to pass it the 'trending' data set so it know it's specifically pulling from that data set rather than starwars:
If we run the above code again we'll get a different 10 random characters but notice how our filter() query is now robust enough to cope with this and doesn't need to be updated!
Create new columns with mutate()
The mutate() verb can be used for creating new columns in your data set. It's maybe the least obviously named of all the dplyr verbs and people can sometimes get it confused with summarise(). I find the easiest way to remember the difference is that mutate makes more columns whereas summarise shrinks your data set.
Let's say we wanted to add a new column to our data set that records the BMI for each character. We do this by calling mutate, writing out the name of the new column we want to create e.g. BMI and then = with the formula or value we want to use in creating our new column:
We can create multiple new columns at once by separating each one with a comma:
We can also create columns using conditional logic with ifelse() and case_when(). Another great feature is that mutate works sequentially so any column we create with mutate is immediately available to use in the next line of code. Say for example we want to convert height from cm into feet and then create a new binary indicator that shows which characters are over 6ft tall. As this is a simple yes/no test i.e. taller than 6ft or not we can do this with ifelse():
If we want to use multiple logical conditions in our mutate e.g. taller than 6ft, shorter than 5ft or somewhere in between we can do that with case_when():
The final 'TRUE ~' on the last row of the case_when acts as a catch-all, like the ELSE statement in a CASE WHEN from SQL. What it does is tell dplyr that for any remaining values in the column that haven't been covered off by the preceding conditions, we just give our new column some default value that we specify. Since we know that missing, >6 and <5 cover off most possible heights, the only values not covered are for those between 5-6ft tall.
We can see what happens if we don't include the catch-all below. For our new column height_group for all the values that aren't covered in the case_when these now default to NA:
Window functions with mutate()
Another really powerful feature of mutate() is that we can create new columns that contain summaries of data from within the table whilst preserving the original data set. These are called "window functions". Let's say we want to add a new column that contains the average height from all characters in the data set:
Notice how the number of rows in the data set stays the same at 87 and we just add a new column which contains the same value for each row showing the average height from all characters in the data set.
This might not seem that useful at first but suppose we wanted a way of easily flagging which characters are tall or short. We can do this easily using the column we've just created. We know our new column gives us the population average height repeated for each row and we can then create a new column that compares the individual height of each character against the population average to work out who is taller or shorter than average.
Now we've got 2 new columns. One shows how much, as a ratio, is a character taller or shorter than the population average. So any character with a height_index>1 we know is taller than average and anyone with height_index<1 is shorter.
We can then use this to create a rule that says any character who is 20%+ taller than average i.e. height_index>=1.2 we flag them as 'Tall' and any character who is 20%+ shorter than average i.e. height_index<=0.8 we'll flag as 'Short'. For all other characters in between or for those with missing heights we'll simply flag them as 'Average' to keep things simple.
A nice feature of dplyr and mutate is that rather than having to create a separate column that records the average population height to use later on, we can just use the calculation directly as part of our height_index formula. This saves us from having lots of extra columns as part of our output:
Aggregating data using summarise()
The summarise() verb aggregates or 'rolls up' your data set to calculate summary statistics. Whereas with mutate() when we calculated the average height we got 1 value repeated for each row in the data set, when we call summarise we get 1 row of data and 1 column i.e. the average height of the data set:
Notice how we still need to include 'na.rm=T' when calculating the mean as unlike when we were filtering dplyr does remove missing values automatically for us. Like with mutate() we can create multiple summary statistics at once by specifying each one we want separated by a comma:
Notice that the mean mass is higher than the median? This suggests we've got a particularly heavy outlier that is skewing our data (any guesses who this could be?). We'll explore this at the end when we put all our new dplyr skills to the test.
dplyr + base R = conditional sums
We can also do conditional counts and sums like you would in SQL by combining some base R functionality with summarise(). Let's say we want to count how many characters have their species as 'Human'. We could apply a filter to the whole data set and then summarise, but what if we also want to count some other characteristic from the non-human characters at the same time?
Conditional counts/sums allow us to include a filter within our summary statistic whilst keeping all of the original data available for other summary statistics to use. So rather than have less data and then count all of it like we would if we filtered first, we instead keep all the data but only count the stuff we're interested in when we summarise. As we're combining dplyr with base R it can look a little messy so the easiest way to demonstrate it is to build it up in stages.
Firstly, remember that in base R, to select a column from a data set we use the $ sign which brings back a vector with all the values in:
Once we have our column we can apply a logical condition to it which turns our vector into a logical one i.e. it returns TRUE when the condition we applied is true and FALSE when it's not:
Notice however in the above example that we also have some NAs in our vector. This is because we have some NAs in our underlying data. Hadley Wickham describes NAs as 'contagious' as an NA value will nearly always return NA for any operation that is applied to it. This is why it can be a good idea to deal with any missing values at the start of your script depending on how you plan to use your data. For now though let's see how we can deal with missing values in our underlying data using base R.
Whereas dplyr's filter() automatically removes NAs, in base R we have to deal with them manually. Since we're trying to find out how many characters have species = 'Human' we can replicate the filter() verb in defaulting NAs to FALSE. We can do this by adding an extra condition to our vector that says to check which rows have species = "Human" and aren't missing:
Now we've got our logical vector showing just TRUE/FALSE we can use that to filter our data set.
To do this in base R we take another vector using our $ and then use square brackets [ ] to subset it with our logical vector. The easiest way to think of this is that the vector outside the [ ] are all the possible rows we could bring back and the [ ] is R's way of asking "which rows should I keep?". We put the logical vector inside the [ ] which quite literally tells R TRUE/FALSE to bring back the row or not.
The cool thing about this approach is that the vector outside the [ ] can be any column in our original data set. This is because the whole process is R saying 'here's a load of possible rows, what do i keep?' Our TRUE/FALSE vector tells it which rows to keep but it doesn't care where the rows outside the [ ] came from. Below we can change our filter to bring back all the names of the human characters rather instead of their species:
Coming to R from more of an SQL background it took me awhile to understand how R knew which values to bring back without some sort of merge or join key. This is because R matches on indexes (row/column number) so if I have a vector with values c("Red", "Blue", "Green") and subset using a logical vector of c(TRUE, FALSE, TRUE) this is the equivalent of saying return the 1st and 3rd element of the vector i.e. "Red" and "Green". The filter though works on any vector I might pass it as it'll always just ask for the 1st and 3rd element.
This is very powerful and to push it to the extreme you could use a logical vector created from one table to filter another vector from a different data set entirely. This can cause issues when the tables have a different number of rows or columns as R recycles vectors of different lengths which can lead to unintended results.
Now we've seen how to filter our data set in base R, and returned a vector of human character names, we just need to add an extra bit of code to actually count how many characters have species='Human'. An easy way to do this is to use the length() function from base R which counts how many elements are in our vector.
As well as using base R to conditionally count our data we can use the same principle to conditionally sum it too. In the same way as before where we used our logical vector to filter for the names of characters we can use it to filter for character's heights and then sum or average these as normal:
When we come to combine our base R filters with dplyr we can take advantage of dplyr's ability to declare the data set at the start of our code and then pipe %>% it to the subsequent functions. This saves us from having to keep writing out starwars$ as dplyr passes the data set to the base R functions automatically.
Aggregate by groups with group_by()
Up until now we've been creating summary statistics across the whole table with mutate and summarise but what if we want to calculate these for different groups within the data set? This is where group_by() comes in. Again, it's very similar to the 'group by' function in SQL although there are a couple of important differences about how it behaves that are worth noting.
A 'group by' in SQL is used along with an aggregation in the select statement e.g. select sum() or count() and shrinks the data set down to the different groups. However, with dplyr, the group_by() doesn't shrink the table on its own but rather leaves a note on the data set that tells other dplyr functions how to behave. This note stays with the object until it is either overwritten with another group_by() or removed with the ungroup() function.
The group_by() we added can be seen in the top left of the print out, 'Groups: species, gender', which tells us that a group by has been added to the table. However we still get the same number of columns and rows returned. We can remove this group by using ungroup():
To check if a table has a group_by() already applied to it we can use the group_vars() function:
It's good practice to remove any group bys once you're finished with them as it's easy to forget they're there and for them to give confusing results further down the line.
Let's start by calculating the average height for each of the different species in our data set:
If we want to use multiple groups at the same time we can just specify them all with a comma in between:
We can also use group_by() with mutate() to create new columns with summary statistics by groups. We've already used mutate() to calculate the average height across the data set and then used it to see which characters are taller or shorter than average. This might have been slightly unfair to some characters, as some species might be naturally taller or shorter than others. A fairer comparison might have been be to see which characters are tall or short for their species. With group_by() we can do this:
We can combine different group_by() statements in the same pipe as each one called overwrites the previous one. For example we could create a new column that stores the average height for each species and then overwrite that with a group_by(homeworld) to create another column for average height by homeworld:
The slice() operator that we've already been using also works with group_by(). As slice() returns rows by their row position we can combine it with group_by() to easily get the tallest 3 characters per homeworld i.e. it takes the first 3 rows within each group:
Apply functions across multiple columns using across()
Up until now we've been performing calculations on one column at a time e.g. take the mean of mass or take the mean of height, etc. With across() it's possible to call the same function to run across multiple columns all at the same time. Say for instance we wanted to count the number of missing values in each column. We could write it out by hand:
This involves a lot of copy and pasting and if we wanted to calculate it for all 14 columns we'd need to write it out 14 times! With across we can apply it to all columns, a subset of columns or even apply it to columns that meet a certain condition. Let's see what the previous looks like when we use across():
We can see that by using across() we save ourselves a lot of duplicated typing (even if we have to use more brackets). Rather than write out a sum function per column, we call across and pass it a vector of columns and then tell it which function to apply. How we call the function gets changed a little bit to make it work with across but it's nothing too major. The structure for across() is:
The columns_to_go_across can be a vector of column names like we've already seen or it can be any from a range of 'selection helpers' i.e. some form of condition that selects only certain columns. For example, the helper 'everything()' tells R to select all columns:
Or we can use the 'starts_with()' helper to pick all the columns starting with 's':
The function_to_apply part of across() is simply where we tell R which function to use on our selected columns it just looks a little unusual. The ~ at the start of our function is stylistically borrowed from the purr package and is used to tell R 'get ready for a function'. After that we simply type out function our but instead of giving it a column name we simply pass it .x as a placeholder instead. The combination of the ~ and the .x are what then allow dplyr to run the function across all the columns we specified in our across. Across works by essentially telling R: this verb (summarise), across these columns (starts_with 's'), apply this function (~sum(is.na(.x))).
Below are some more example of how we can use and combine different selection helpers to pick which columns we want to call our function on:
There's one final option in across that can come in handy and that's .names. This gives us the option to specify the name of of the columns that get created. We can either pass it just a normal string e.g. .names='my_new_name' or we can use some functionality from the glue package to combine the original column name with our string. To do this we use {.col} to tell dplyr we want to use the column name and then combine it with our string. For example:
So far we've been calling one function at a time in our across() but dply is actually powerful enough to allow us to stack multiple functions into a list and call them all in the same across(). We do this simply by putting list() in front of our function call and then giving each separate function it's own name. This use of list() also has the handy side effect of adding our function name as a suffix to the outputted columns:
The automatic renaming with the list can be handy even if we're only calling one function but want to preserve rather than overwrite the columns we're calling our function on. For example we could use mutate() to replace missing values in our numeric columns but we might want to also keep the originals at the same time. We can use list() to do this for us:
Case Study: picking players for a basketball tournament
Now we've been through all the different dplyr functions let's try putting them in a pipe altogether. Suppose we've been asked to help pick teams for a star wars universe basketball tournament. Both the droids and humans have enough players to field a whole team each whereas for the other species we'll make an "All-Star 5". As it's basketball we'll want to pick the tallest players first and then for any players tied on height we'll use mass to pick the lighter player. Here's how we might write a dplyr pipeline to achieve this:
If we have a look at the players selected we can see that the droids might struggle as they've 3 players all under 1m. The humans are all pretty consistent around 1.9-2m but it looks like our All-Star 5 are probably the pre-tournament favourites with all their players 2.2m+ and the inclusion of two Wookiees.
To help our understanding of which team might be favourites, we can pass the output from our dplyr pipe to another tidyverse package called ggpot2 to better visualise the data:
The plot looks to confirm what we already thought. The All-Star 5 team has all of the tallest players and they don't appear to be significantly heavier than the human team either. The Droids look to have the shortest team and so likely to struggle the most.
Case Study: finding the outlier
Remember when we were looking at summarise() and we saw that mean mass was higher than the median?
The median mass is the value from the character in the middle of the data set when it's sorted. The fact the mean is different from the median tells us that the distribution is not symmetrical. Since the mean is higher this tells us that it's likely we've got either a particularly heavy character or set of characters that is skewing our data.
A common way to identify outliers is to find observations whose value is more/less than 3 standard deviations from than the mean. We can use this same approach in a dplyr pipeline and plot the results:
From our plot we can see that our pipeline successfully flagged 1 observation as an outlier for mass and from the plot we can see just how much of an outlier Jabba the Hutt really is!
Bonus: reshaping data with pivot from tidyr
I've called this section a bonus as, technically, the pivot function comes from the tidyr package rather than dplyr but it's very handy so I thought I'd add it in. Pivoting essentially allows you to reshape your data and in tidyr we do this with the helpfully named pivot_wider() which takes long data (lots of rows) and makes it wide (lots of columns) and pivot_longer() which takes wide data (lots of columns) and converts it into long data (lots of rows).
We'll start with pivot_wider(). There are lots of options you can read about on the reference page but the main ones are:
-
id_cols - the values you still want to have as rows in your pivotted data e.g. name or some form of identifier
-
names_from - where the name of the newly created columns should come from
-
values_from - what columns you want to use to populate the newly created columns
The easiest way to understand how it works is to see it in practice so let's create some 'long' data and then repeat the process but with a pivot_wider():
You can see what's happened is that our pivoted data set has got fewer rows (3 vs 6) and whereas before we had 'species', 'sex' and 'avg_height' as the columns, we've now got 'species', 'female' and 'male'. The 'sex' column has disappeared but the values from it ('male'/'female') have become our new column names which is the 'names_from=sex' at work. The values of those new columns are taken from 'avg_height' which is what the 'values_from=avg_height' does.
We can do the inverse pivot which takes wide data and makes it longer using pivot_longer(). The main options to be aware of here are:
- cols - this is where we tell tidy which columns we want to covert into rows
- names_to - this names the column that stores the values of the columns we are pivoting
- values_to - this is the column name that records the values from the columns we're pivoting
Let's see it in action:
In this case our 'wide data' is a count of missing values across each column. When we call out pivot_longer() we can then tell it that we want to pivot all of the columns by using 'cols=everything()'. We record the names of all the individual columns we've included in our pivot in a new column called 'variable' which is what the 'names_to='variable'' does. Finally we record all of the pivoted values in the 'values_to='number_of_missing''.
Congratulations!
Well done on making it this far. Hopefully you now feel confident data wrangling with dplyr!
If you'd like some extra reading make sure to check out Hadley Wickham's free online book on using the tidyverse for data science or you can buy a copy here. If you want to brush up on your base R or learn more about how R works behind the scenes you can check out this Introduction to R post.
Rstudio also have loads of useful articles on dplyr and the tidyverse as well as this handy cheat sheet of all the main dplyr functions. If you want to see how you can do similar data wrangling in python check out this post on data wrangling using the pandas package.