Data wrangling with pandas tutorial
In this post you'll learn how to use the pandas package in python to explore, select, filter and sort your data, create new variables and produce summary statistics.
Contents
What is pandas?
Pandas is the main data manipulation package in python. It was created by Wes McKinney and first released in 2008. It provides a huge range of tools for the manipulation of data. The pandas documentation is also some of the best I've ever seen and well worth checking out. It even includes handy comparisons between how pandas operates compared to other languages like R, SQL or SAS.
"pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive." - pandas.pydata.org
For this tutorial we're going to be using the Starwars data set which contains information on 87 different characters from the starwars universe. It usually comes loaded as part of the dplyr package in R but a chap called Vincent kindly uploaded it to github along with all the other R data sets. If you fancy seeing the equivalent of this post for the dplyr package in R you can check out this post: data wrangling with dplyr. But for now let's go ahead and load the pandas and numpy packages and read in the data:
Exploring the data
Pandas has a ton of useful methods to investigate the data. We can start by calling head() and tail() to see a few rows from the data set. By default these display 5 rows when called but you can change this to any number of rows within the brackets e.g. head(10):
As well as seeing snippets of the data, we can quickly get some top line summaries about the size and shape of our data. Data frames in pandas have the shape property which we can call to tell us the number of rows and columns in our data set:
If we want a bit more info than just rows and columns, the info() method provides a summary of the size and shape of the data set as well as a summary of each column including the data type and how many rows have non-missing data. From this we can already see that not every column has the full 87 rows and so our data set has some missing data:
We can see that both mass and birth_year have the most missing data. Later on we'll see how we can use pandas to impute or replace missing data.
We can also start to look at the distribution of our data using describe() which generates some descriptive statistics for each of the columns. By specifying the include="all" option we can generate summaries for string/character as well as numeric columns. For numeric columns we get the count of non-missing values, the mean, standard deviation, min, max and quartiles. For strings we get a count of non-missing values, the number of unique values in the column, the first value in the column and the most common value. The numeric metrics are NaN for string columns and vice versa.
Sorting your data
As well as getting handy summaries of our data, we can use the sort_values() method to sort our data and see the first and last rows. For example, say we're interested in who's the shortest character in the data set. We can use the sort_values() method and tell it we want to sort by the 'height' column:
We can see that Yoda is the shortest character followed by Ratts Tyerell.
What if we want to know who's the tallest? We can run the same query but this time tell pandas that we don't want to sort ascending i.e. to sort descending instead. We do this by adding "ascending=False" into our method:
Now we can see that Yarael Poof and Tarfful are the two tallest characters.
We can pass multiple sort conditions as a list too. Say we wanted to find the shortest and then tallest characters for each homeworld. To do this we put 'homeworld' first in our list to sort by and then 'height' to sort by height after sorting by homeworld.
By default sort goes in ascending order so we'll get homeworld A to Z and then height short to tall. To get homeworld A to Z and height tall to short we need to provide another list to the ascending option telling pandas which elements from our first list we want to be ascending or not:
Renaming columns
As well as looking at the underlying data we might want to rename or change the format of some of the column names. We can access the column names of a data frame by using .columns which we can then modify and overwrite the original with. For example, as the column names in python are case sensitive, let's say we wanted to change them all to be upper case:
Alternatively we might find all caps a bit shouty and instead want to have them all in lower case:
We can also rename columns by using the rename() method and passing a dictionary of the new and old names. This takes the form of data_set.rename(columns={'old_name1' : 'new_name1', 'old_name2' : 'new_name2'})
Selecting and dropping columns
There are two ways to select individual columns from a data set in pandas. The 'dot method' which is simply dataframe.column_name and the other uses square brackets [ ]:
The dot version is a bit easier to write but can get into issues when the column name has things like spaces in it or shares the name of an actual python method. For example what happens if we rename the "mass" column to be called "mean" and then try to select it?
The dot method fails in this instance as python thinks we're calling the mean method on the data frame. However the square brackets method does work and successfully brings back are 'mean' column from the data set. This extra robustness is why the square brackets approach is preferred although it's good to be aware of the dot method as you'll see it get used a lot out in the wild.
Up until now we've only been selecting one column but what if we want to select more than one? To do this we can reuse our [ ] notation but instead of having just the single column name we can pass a list of names. Remember that lists are one of the four main data types in python and can be used to store multiple items in one place. We construct our list using [ ] so to select multiple columns we can write:
The first set of [ ] is the same as before and effectively tells python to get ready to select some columns, the second set of [ ] inside then creates our list of the columns we'd like returned. Notice that by passing a list using [[ ]] we get a data frame returned rather than a series like when we selected 1 column using a single [ ]. We can call the two side by side to see the difference:
Our series has a length (87) which is the number of observation in it whereas our data frame is tabular in nature with 1 column called 'name' and 87 rows of data.
As well as writing out our list, if we wanted to reuse it multiple times in the same piece of code we could save it as its own object which we can then call directly. For example:
If we want to drop columns from our data set we can do it in a similar way too. We simply call drop() and tell it the columns we want to drop as a list:
Slicing rows
Pandas offers the slicing function to allow us to quickly select rows from our data frame. We simply tell it what row numbers we're after. We can do this by again using [ ] but instead of writing out column names we pass in the range of row numbers we're after. Remember that python starts counting from 0 and that the last number in a range is exclusive i.e. not included. This means to get the first three rows of data we really need to ask for the first 4 rows as the 4th row won't actually get returned:
If we want to select all rows up to or from a certain point we can leave one half of the range blank e.g. to select the first 10 rows we can simply say that we will all rows up to 10:
We can use negative numbers to specify ranges starting at the end of the data set. For example to get the last 3 rows we can combine -3 with : to ask for all the rows from the third last till the end of the file:
Chained indexing and SettingWithCopy warning
So far we've been selecting either rows or columns. To do both at once we might try combining our previous two approaches:
This gets us the result we're after and can be a handy way to view the data but will cause issues further down the line. Having two sets of [ ] next to each other is known as 'chained indexing' and causes problems when we try to use it to assign new values to our selection. For example let's say we want to replace the name of 'Luke Skywalker' with 'Mr Skywalker'. If we try to assign new values using chained indexes we get a 'SettingWithCopy' warning:
From the pandas documentation: "assigning to the product of chained indexing has inherently unpredictable results". This is because sometimes pandas will return a copy of the data frame but other times you'll be modifying the original data set. When chained indexing is used there's no way for pandas to guarantee which is being used and so any assignments we make might not behave as expected which is why we get the warning.
The correct way to subset on rows and columns in a safe and robust manner is actually included in the error message too: 'Try using .loc[row_indexer,col_indexer] = value instead'.
Filtering columns and rows using .loc and .iloc
Both loc and iloc allow us to select and filter columns and rows in a way that is safe to use with assignment. The main difference between loc and iloc is how we select the rows/columns, either by name/label (loc) or by index (iloc). The easiest way to remember which is which is that loc stands for label based indexing i.e. we reference rows/columns by their names whilst iloc is for integer based indexing i.e. we use the row/column index.
We can select our rows/columns just like we've been doing with either a single value, a list of values or a slice range. Both loc and iloc also work with boolean/logical arrays which we'll cover later.
One thing to keep in mind when the row labels are numbers rather than say text labels is that although the row labels are numeric they might not align perfectly with the underlying table index. For example in our data set the first row has the label 1 but its row index is still 0.
This means depending on whether we're using loc or iloc to access the first row we need to refer to it slightly differently:
Let's run through some more examples using labels and loc:
In each example we've used the text labels for the columns e.g. 'name', 'species', etc. but used numbers for the rows. As mentioned this is because our data set has numeric row labels for us to refer to although these are different to the row indexes. If our data had text labelled rows then we'd need to refer to these in the same way as we did for the columns:
See how the row index is now the name of the characters and no longer a number. If we wanted to select the first three rows now, we'd need to know what their index label is and refer to it in a list like we did for the columns:
We'll tend to use iloc.() when we're less interested in selecting specific characters and more in selecting specific rows/columns because of their location in the data. For example using the same starwars_copy data with the named index if we just want the first 3 characters and first 2 columns we can write this very simply without having to look up any labels:
This is helpful because not only does it offer a concise expression of what we're doing but the selection criteria are robust against changes in the underlying data. For example what if we change the order of the data set but still want to get the first 3 rows? Using loc we'd need to rewrite our filter to be the new character names appearing at the top the of the data set. But with iloc we don't need to change a thing:
Filtering using boolean arrays (True/False)
Up until now we've been filtering based on labels or indexes but often we'll want to filter our data based on some sort of condition. This can be done in pandas using boolean arrays. 'Boolean' just means that the values can only be True/False and an array is just a sequence of values.
We can construct our boolean arrays by applying logical conditions to our data. Below are some of the logical operators available to use in pandas. One thing to note with pandas is that is for logical operations it treats NA values as False which is different to other languages such as R.
To show how filtering by boolean arrays works, let's construct one a step at a time. First we'll use the height column which records the height of every character in cm. Let's say we want to eventually filter to only bring back characters that are over 200cm tall. First we'll use our column selection skills to bring back the column we want to filter on:
We can then turn our series of all the heights into a boolean array by applying a logical condition against it. We want to know when a character's height is over 200cm so we can apply this condition to the series:
Now we have our boolean array! Rather than a series of all the heights we've converted it into a series of True/False - True when the value was >200 and False when it wasn't. We can see from our previous table that index 4 had the value of 202cm and indeed index 4 in our boolean array is 'True'. We can also see that indexes 83-86 where the height was NA are also recorded as 'False' in our array.
Now we have our boolean array, we're able to pick out each row in our data frame where the condition height>200 is 'True'. We can then use this to filter our original data set. We do this by passing our boolean array like we would any other list of values inside [ ]:
We can combine the boolean filter with other types of filters as both loc and iloc accept boolean arrays. For example we can filter for characters taller than 200cm and then only bring back their name and height:
The boolean array acts as a row label/index to only bring back rows where height>200cm is 'True' and then we filter the columns we want by passing a list of column labels.
Let's try creating filters using some of the other logical operators available to us. For example what if we wanted to filter for characters that are either Wookiees or Droids? We can do this by using the .isin() method and pass it a list of the different species that we're interested in:
We can use is.na() to find rows that have missing values. For example to find characters that have a missing height:
We can also use ~ to negate our boolean array e.g. filter for where our condition is False rather than True. For example what if we wanted to find all the characters whose height is not missing? We can use the same query to find those with missing heights but then negate it with ~ to return those where it's not the case their height is missing:
Just like how we saved our list of columns into an object earlier, we can do the same with our boolean array. Say we wanted to keep a record of tall characters (height >200cm) rather than keep retyping the filter we can say it into an object and then call it directly as part of our filter:
We can also combine our logical operators with | (or) and & (and) to create more complex filtering conditions. For example what about characters that are tall and human? We can do this by constructing two boolean arrays and joining them with '&' to show we want want them to be >200cm and Human:
Notice how we have to put brackets ( ) around each of our conditions. This is the same even if we wanted to use our boolean object instead of writing out the conditions:
Combining conditions allows us to create more and more complicated filters but can result in a lot of typing. In the first example we needed to type out the name of our data set 'starwars' 3 times! Once to tell python to filter it and once again for each filter condition we created. If we had 5 conditions we'd have to write it out 6 times! In the second example we only had to type 'starwars' twice as one of our conditions was stored in an object but we still had to create that object earlier in the code.
Creating more complex filters with query()
Often the data set we want to filter is the same as the one supplying the columns for our filter conditions. For example we wanted to filter the starwars data set using the height and species column from the starwars data set. Thankfully pandas provides us a handy way to create more complicated filter conditions and save us some typing.
The query() method allows us to write out our filter conditions as a string that are then applied to our data frame. This can save us a lot of typing. For example compare the query() version of our 'tall human' filter to the original below:
The time saved typing becomes even more apparent when we want to create more and more complicated filters. For example what if we want humans from Tatooine that are between 100-200cm tall?
The only thing to watch out for when using query is that it needs a slightly different way of finding NA values. The isna() doesn't work with query but we can make use of an interesting quirk in how python resolves operations involving NA values. Remember that when applying logical conditions to them they return False? We can use this bit of knowledge to create a missing value filter that works with query:
Creating new columns
Just as we saw how we can select a column using dataframe["col_name"] type notation we can use the same format for creating new columns. To create a new column we can simply type data_frame["new_column"] = and then whatever it is we want our new column to be. For example let's say we wanted to create a new column that records each character's height in feet:
We can borrow the where function from numpy to create new columns based on conditional logic. Remember at the start we imported the numpy package as np? When we want to call functions from specific packages we can write package.function. So for 'numpy where' we write this as np.where().
np.where() works similarly to an ifelse() in R or if() in excel. We feed it a column from our data frame with a logical condition to test, the value if that condition is true and then the value if it is false. For example, now we have our height_ft column, we can flag which characters are over 6ft tall:
Inside np.where() we first have the condition to test e.g. starwars["height_ft"]>6 and then the comma tells python what to put in the column if that condition is true with the next comma telling it what to put if the condition is false. For example we could create a binary flag for anyone that has the phrase 'Skywalker' in their name:
We can link multiple np.where() clauses together to create multi condition clauses. In the example below we flag any character over 200cm as 'Tall' but for those that aren't we then have another np.where(). This then says that anyone <100cm is 'Short', otherwise, say they're 'Average':
At the moment we've been creating each new column in a separate query e.g. create a height in feet column in one query and then create a flag for characters over 6ft in another etc. We can actually create multiple new columns in one go using assign().
Even better is that pandas is capable of creating new columns that refer to columns created in the same assign. We just have to reference them in a slightly different way. Let's have a go at creating all 3 new columns in one assign step. For simplicity we'll only select character name and height form our data frame to begin with:
The first two new columns 'height_ft' and 'height_group' make use of the existing 'height' column in the data frame. The final 'over_6ft' flag however uses the 'height_ft' column that gets created as part of the assign. This means it's not readily available in the same way to use like 'height' was and we instead need to use a lambda function to refer to it.
A lambda function is a so called 'anonymous function' in python. Anonymous as it is not defined and given a name like we would normally for a function. They're often used when we want to make an ad hoc function on the fly. Lambda functions take the form of: "lambda arguments: expression". You can think of 'arguments' as the things we might feed into our function e.g. the data set we want to call the function on and the 'expression' is the activity we want the function to perform.
In our case we want the expression to be our np.where to flag which characters are >6ft and our argument is the data set we are using in our assign. We can represent our 'starwars' data set with a placeholder value of x (any value would actually do). By using a function it allows us to reference the 'height_ft' variable which is created as part of the assign but doesn't yet exist on the starwars data set itself.
When creating new columns we can also use summary functions either on their own or as part of a longer process. For example, if we wanted to, we can create a new column that records the average height from the data set and save it as a new column:
This isn't super helpful on it's own but it's handy to know that pandas is clever enough to not just simply calculate the average height but to also populate it for each row in our data frame. This is because the average height is a 'scalar value' i.e. a single value so pandas has no qualms about repeating it for each row in our data set.
We can make use of this when we want to use population level summary statistics to create new features. For example, rather than deciding cut off values ourselves for which characters should be classed as Tall, Short or Average we can base it on how much taller or shorter they are than average. We might decide that any character who is 20%+ taller than average is 'Tall' and anyone 20%+ shorter than average is 'Short'. We can do this easily by incorporating the mean() method into our new feature creation:
Window functions with transform()
Comparing character heights to the average gives us a good indication of which characters could be classed as taller/shorter but another useful comparison might be to compare heights within the same species. This way although we know ewoks tend to be shorter than average we could still flag characters who might be "tall for an ewok" or "short for a wookiee".
To perform this calculation we once again need to calculate average heights but this time we want to do it by a group i.e. 'species'. We also want our output to be the same size as our original data set rather than shrink/be aggregated. These types of calculations are called window functions: "A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row".
To do this in pandas we can use group_by() and transform(). The group_by() allows us to tell pandas we want to create a summary by groups and transform() always returns a data frame the same size as the original so with both together we can create our window function. To show how they work it's easiest to build up the query in stages:
We can see from the output that we have 87 rows so the size of our output is the same as our data frame. We can also see that the average height values vary so we've successfully calculated different average heights.
We can then add this new average species height calculation back to our original data frame like we would when creating any other type of new variable:
We can see now that the repeating values correspond to characters from the same species e.g. all Humans have an average species height of 176.65cm whereas for Droids it's 131.2cm. If we wanted to repeat the exercise of calculating each character's height relative to the average we can do so by dividing each character's height by our group_by/transform query:
We can now see that C3PO who was only "Average" height compared to the total population is now classed as "Tall" when compared to other droids.
Aggregating data
Up until when we've been calculating summary statistics we've been preserving the overall size and shape of our data set i.e. it stays 87 rows long. Now we're going to be looking at summarising our data set into a smaller, aggregated version. We can perform a lot of simple aggregations on our data using pandas' built in summary methods. By default they also exclude missing values which is handy. We just select the columns we want to use and then call the summary method on them:
If we want to perform multiple summary statisics at the same time we can use the agg() method. We simply take our list of selected columns and call agg() where we pass another list of all the different statistical summaries we want to calculate:
Notice that the mean mass is quite a bit higher than the median? This suggests we've got a particularly heavy outlier that is skewing our data (any guesses who this could be?). We'll explore this at the end when we put all our new data wrangling skills to the test.
We can actually calculate different summary statistics for different columns if we want to. We do this by passing a dictionary of terms to the agg method that tells pandas which column we want and the corresponding statistic:
The downside of doing it this way is that for columns where we didn't call the same summary function they'll have NaN e.g. we didn't call 'min' for height or mass and so the min values are missing for those columns. Since we called 'mean' on all our columns this row is populated for each in the output.
Conditional aggregations
Conditional aggregations allow us to create summary statistics for data that meet certain conditions. For example let's say we wanted to know how many characters in our data set are over 200cm tall. We can filter the data set and then count the number of rows:
We can construct more complicated conditions in the same way we did when filtering our data and we can also use query() too:
Like before we might prefer to use query() to save us some typing:
The pre filtering works well if we just want the summary to be about one group of data e.g. where our condition is true. But what about if we want to do different calculations all with slightly different criteria? e.g. if we wanted to count how many characters are tall, how many are short and how many are average height?
We could do 3 separate filters and aggregations or we can be a bit clever in how we set up our query. Rather than filter the data and count the results we can instead use assign() to make some new temporary columns that are either 0/1 when our condition is met and then sum those columns to get our counts:
Aggregating within groups with groupby()
We've already seen groupby() at work when we were creating our window functions. Now to create summary stats by group, rather than use transform to preserve the number of rows, we can call our summary method directly:
Once again the groupby() tells pandas which groups we want to use to calculate the summary by e.g. groupby(['species']), we then specify which columns to use in the summary e.g. ['height'] and then finish it off with our summary method e.g. mean().
To get an output similar to our Tall/Short/Average counts, rather than create 3 separate columns and then sum them, we can instead make a new column with the classification in and then group by that:
We can also create summaries across all columns of a specific type by calling our method straight onto our grouped data set. For example to calculate the median value of all numeric features by gender we can run:
Notice on the output that the two values of gender have now become the new index labels. We might prefer to keep the index label as an integer and we can do this by calling reset_index() at the end of our aggregation:
Resetting the index can be particularly useful when grouping by more than one column at a time:
For those familiar with Microsoft Excel, this output looks more like something we might get from a pivot table. We'll also cover the pandas implementation of pivot table later. For now though we can use the reset_index() to 'flatten' the table and fill in our gender index:
This improves the table but we still have the floating row for height median. Unfortunately there's no automated way to remove this so we have to add in an interim step that renames the column:
The renaming of the column comes before the reset index as this means we only need to pass it new names for our aggregations e.g. 'height_median'. You could rename the columns after resetting the index but then you'd need to specify names for 'gender' and 'homeworld' too.
Removing duplicates
Often we'll be working with data that's messy or the output from a process involving multiple different systems and as such can often contain duplicate values. Pandas has a couple of handy functions for dealing with duplicate data whether they are true duplicates e.g. the entire row of data is duplicates or if certain key columns have duplicates e.g. duplicate occurrence of what are meant to be unique identifier keys. First let's check whether our data has any entirely duplicated rows in it:
We call the method duplicated() on our data frame and it returns a boolean array with True whenever there is a row that is identical to another row in the data frame. We can sum this array to get the total number of duplicates:
So we can see that there are no true duplicates in our data and that each row is unique. We can manufacture some duplicates by appending the data set to itself and rerunning the test:
Now we can see we've got 87 duplicates which makes sense as there were 87 rows in our original data frame. We can remove these by calling the drop_duplicates() method on our data frame:
We can see that the duplicates have been successfully removed and we're back to have 87 unique rows again!
Often the duplicates in our data might not be completely true duplicates i.e. not every value across all the columns is the same but it might be that a supposedly unique key is duplicated even if other values in the row after different. For example if our process relies on manual data entry where each new row is meant to be assigned a new ID it is easy to imagine someone losing count and accidentally inputting the same ID twice or making a typo and recording say row 111 as 11 which would cause our supposedly unique key to no longer be unique.
We can test and deal with these instances in the same way we did with the true dupes but instead of passing our entire data frame to duplicated(), we just use the columns we're interested in. Let's edit our data slightly to introduce a duplicate in the 'name' column but in a way that isn't a true duplicate of the entire row:
As it's just the name column that's duplicated, if we call our duplicated() on the entire data frame it won't get picked up as not all the values are the same. What we can do instead is just pass the column we want checking for duplicates using the 'subset=' parameter.
The way duplicated() works is that the first occurrence of a duplicate isn't flagged but any subsequent occurrences are. We can tell pandas whether to flag the first, last or all occurrences as duplicates with the (keep=first/last/False) parameter:
To see the full output of our duplicated rows we can filter our data frame with our boolean array. For example we might want to see both rows returned to help us decide which one we want to keep:
Then using our subject matter expertise we know that C-3PO is a relatively tall character who comes from Tatooine so that gives us confidence that the first occurrence is the correct one and that we want to drop the second duplicate.
We can do this simply be negating our boolean array of duplicates. Whereas until now we've been returning True for dupes and False for unique values, we can simply use the opposite of this to filter e.g. filter where it's not true that the row is a duplicate. As we want to exclude the second row we can tell our duplicated() that we want to keep the first occurrence:
Dealing with missing data
As well as duplicates, it's common to have missing data in our tables. We've already seen how to find missing data with isna() and we can combine this with other methods to get a handy summary of all the missing data in our table.
Here we can use isna() to return True/False for missing, add up all the missing records by column with sum() and then sort to get the column with the most missing at he top.
We've a couple of different ways of dealing with this missing data. The dropna() method allows us to remove rows or columns with missing data. By default it removes any rows with missing data:
This only leaves us with 29 rows of data though which is quite a reduction! We can change the strictness of the criteria by changing some of the options within dropna(). For example using the 'how' option we can decide when to drop rows e.g. how='all' which tells pandas only to drop rows where all the values are missing:
We can see that this time we get the full 87 rows so we know that none of our characters have entirely missing data. This is good to know but hasn't really helped us with our missing data. If we think dropping a row with any missing data is too strict but only dropping rows that are 100% missing is too lax is there a middle way?
We can use the 'thresh' parameter to provide a threshold of missingness that removes any rows that have missing data above our specified amount. For example we have 11 columns in our data set so we might want to keep rows that are not missing in 10 or more columns:
This time we return 67 rows which seems like a good compromise. We can also make our condition slightly more generalisable by replacing the hard coded '10' columns with a formula. Say we want to keep any rows that have 80%+ of their columns populated:
The starwars.columns returns all of the column indexes in the data frame which we can then count how many there are use len(). We then multiply this total number of columns by 0.8 to create our 80% of columns threshold.
By default dropna() works by looking at missing values by row but we can use the axis parameter to change it to look at columns instead. For example, say we're building a model and want to exclude any columns that have 20%+ missing data from the feature set. We can adapt our previous code to easily implement this restriction by telling pandas that we now want to look at column completeness. We do by setting axis=1 and we use index() to get the number of rows in our data frame which we then multiply by 0.8:
We can see that of the 11 columns in our data set only 9 of them pass the threshold of being 80% complete.
As well as dropping missing data we might want to replace it with something else instead. Replacing missing values is often known as 'imputing'. For example for something like 'species' we might want to record the fact the data is missing but not want to keep it recorded as explicitly na due to how pandas then uses those value in calculations like groupby().
A simple fix is to replace all na occurrences in species with the string 'Missing' instead. To replace na values pandas has the handy fillna() method:
As well as just using a single value to replace missing values we can actually use formulas too. For example we can replace any missing height values with the average from the population:
We don't even have to use the fillna() if we don't want to. For example, with a bit of cleverness we can construct our own imputation. Say, for instance, rather than use the somewhat crude population average for missing height values we wanted to use the average species height as we think this'll be more accurate.
This works in most cases but there are some characters that are the only example of their species and their height is missing! This means that we'll still get na for those characters. For those few cases we might decide that using the total data set average is ok. We can construct this imputation ourselves by combining multiple np.where() clauses together:
We can see the difference between this approach and just using the total population average. Captain Phasma keeps the population average height for example but Rey, Poe Dameron and BB8 now have the average height of their species which differs from the total population height.
There's a lot going on in the query so I've tried to write a description of what each section is doing and how it relates back to the code in python:
Ultimately there are three things we're trying to achieve with the query: if the character has a valid height then to use that, if it's missing then to try an replace it with their species average height but if that would still be missing then to use the population average. We can do all this with our np.where() we just need to be a bit clever about the order we test our conditions.
Reshaping data with pivot_table()
For those who've used pivot tables in Microsoft Excel the pandas package provides something very similar with the pivot_table() method. It allows us to reshape our data and also perform various aggregation on it all. It can be handy way to quickly gain insight into data as well as some shortcuts to reshape our data from long to wide.
The basic options for pivot table are below. You can read the full list of parameters here in pandas documentation:
A quick summary of the different options:
-
pd.pivot_table tell python we want to use the pivot table specifically from the pandas package
-
data = the data frame we want to pivot on
-
values (optional) = the column whose values we want to use in our calculation e.g. the thing we want to sum/average/etc
-
index (optional) = tell pandas which column we want to take our row values from in the final output. This performs a similar job to as if we'd told pandas to group by this column.
-
column (optional) = the column whose values we want to form the new columns in the output
-
aggfunc = the type of aggregation(s) we want to perform e.g. sum/mean/etc
-
fill_value (optional) = what to replace missing values with
-
sort (optional) = whether or not to sort the final table by the index
The easiest way is to see it in action. Say we wanted to look at the difference in average height by species split by gender:
Apart from a lot of species not having any female characters we can see that males are typically taller than female from the same species. Using .reset_index() can be useful here too:
Pivot tables can be very handy if we want to quickly aggregate our data to have a look at it. Say we want average height by species:
We can then quickly add in a second group e.g. homeworld:
We can even start asking pandas to calculate multiple statistics in one pivot although the headings start to get a bit messy:
It's that second row of data where we just have 'height' repeated 3 times that's the issue. This is actually a second index but we can use another handy method .droplevel() to remove it. Since it's the second index of the columns we want to remove we use the parameters: droplevel(1, axis=1)
Much neater! As well as pivoting our data which generally aggregates and widens, e.g. in the above example we went from 3 columns: homeworld, species and height to 5 columns, sometimes we want to do the opposite. We can do this with melt().
Unpivoting a data frame with melt()
Up until now we've been pivoting our data but it's also good to know how to unpivot it should the need ever arise. The melt() method can be thought of as the opposite of a pivot. It's usually used to take data that is wide i.e. lots of columns and make it long i.e. lots of rows. To demonstrate it we'll take 3 columns from the starwars data frame:
The syntax of melt is similar to pivot_table:
-
id_vars = the column(s) we want as as our row identifiers and will become the new row index
-
value_vars = the columns we want to unpivot (if left blank it'll use all the columns that aren't id_vars)
-
var_name = name we want to give to the column that shows which columns we've unpivoted (defaults to 'variable')
-
value_name = name we want to give to the values/contents of the column we've unpivoted (defaults to 'value')
As ever, it's easier to see it in practice so let's take our 3 columns and start melting! So at the moment we have 'name' as our row ID with 87 rows and 3 columns of data. If we start by melting just one of the columns we can see what's happening:
So 'name' is still our unique ID and we have 87 rows of data but instead of having a column called 'height' with all the height values we now have a generic 'value' column and a new column 'variable' that tells us where that value came from i.e. the 'height' column.
If we do some more melting, this time including 'height' and 'mass' we can see how melt lengthens or 'stacks' the data to create the new table:
We've still got our 'name', 'variable' and 'value' columns but our output has doubled in size to 174 rows. This is because we have 'height' and 'mass' being recoded in the 'variable' column. This is where melt() gets it's name from. We've taken an 87 row and 3 column data set and now we've 'melted' each of the columns so they now sit below each other in our output.
Helper function apply()
Pandas has a handy helper function apply() which allows us to specify a function and have it applied to all the columns or rows in our data frame or series. This is very helpful and can save us lots of typing if we want to repeat the same function across lots of columns e.g. get the mean of all numeric columns. However it comes at the cost of speed so should be used sparingly.
Up until now we've mostly been using vectorised functions, so called as they work on the entire row/column i.e. vector of values at once. This allows for lots of optimisation to happen behind the scenes so it runs quickly. When we use apply() however, because pandas doesn't know what function we're going to be using, it works by applying it to a single value at a time making it a lot slower.
Case Study: picking players for a basketball tournament
Time to put our new pandas skills to the test! Suppose we've been asked to help pick teams for a starwars universe basketball tournament. Both the droids and humans have enough players to field a whole team each whereas for the other species we'll make an "All-Star 5". As it's basketball we'll want to pick the tallest players first and then for any players tied on height use mass to pick the lighter player.
If we have a look at the players selected we can see that the droids might struggle as they've 3 players all under 1m. The humans are all pretty consistent at ~1.9m but it looks like our All-Star 5 are probably the pre-tournament favourites with all their players 2.2m+ and the inclusion of two Wookiees.
To help our understanding of which team might be favourites, we might want to plot the results to make them clearer. Pandas actually contains a couple of basic plot functions to do precisely this!
The results aren't super pretty(!) and there's also a bug that drops the labelling of the x-axis...but for a quick plot without needing any other packages we can see that the All-Star 5 do indeed appear to be the favourites with all their players (yellow dots) being the tallest. The humans might be second favourites and then the droids last with a couple of shorter players.
Case Study: finding the outlier
Remember when we were looking at summarise() and we saw that mean mass was higher than the median?
The median mass is the value from the character in the middle of the data set when it's sorted. The fact the mean is different from the median tells us that the distribution is not symmetrical. Since the mean is higher this tells us that we've got either a particularly heavy character/set of characters that's skewing our data.
A common way to identify outliers is to find observations whose value is more/less than 3 standard deviations from than the mean. We can use this same approach using assign to calculate mean, standard deviation and then create our outlier flag:
Unsurprisingly Jabba the Hutt is our outlier! We can see how much of an outlier Jabba the Hutt truly is by again creating a quick plot:
Congratulations!
Well done on making it this far! Hopefully you now feel confident data wrangling in python with pandas. If you'd like some extra reading the pandas documentation is definitely worth a look. The author of the pandas package Wes McKinney also has a book 'Python for Data Analysis' which you can buy here.
If you fancy seeing how to do similar techniques in the R language you can check out this post on Data wrangling with dplyr.