A step by step introduction to R
In this post you'll learn about the history of R, its different data types and structures, how to create your own objects and functions as well as installing helpful packages such as the tidyverse from R Studio.
Contents
A brief history of R...
R is an open source language that has its roots in the 1970s language 'S' that was developed in Bell Labs. R has many applications in data science from data manipulation and statistical modelling to producing beautiful graphs and plots. It also benefits from a vast community of contributors who create and maintain all sorts of different packages which means R can be used for a huge variety of projects.
"R is a language and environment for statistical computing and graphics." - R-project.org
There's also RStudio who not only make the integrated development environment you're very likely using to run R but also created and continue to develop an entire universe (known as the 'tidyverse'). These are a collection of different packages aimed at making working in R a lot easier by providing a consistent design and philosophy along with lots of powerful functionality.
Given its long history and huge array of community created, open source content writing an introduction to R is a bit of a challenge and depends on how people like to learn. This post is designed to be a bit like reading the instruction manual, explaining the different data types and structures that form the building blocks of the R language. If you'd rather get stuck in, play around and start working with data, I'd suggest heading over to 'Data wrangling with the dplyr package' and then referring back to this if you get stuck.
In this post I'll mainly be sticking to what's commonly known as 'base R' i.e. R without any extra packages apart from at the end where I'll go into the tidyverse in a bit more detail.
Code, text and comments
R works straight out of the box as a calculator that we can use to solve mathematical equations. R recognises the standard mathematical operators (+, -, /, *, ^) and we can simply plug our equation in using these and R does the rest e.g.
However, if we try to pass R some raw text it'll assume it's code we want to run which will cause it to error:
We can get R to recognise our block of text (also known as a 'character strings') by putting quotes around it. R accepts single ' and double " quotes:
We can see that R has 'printed' our message 'Here is some text' in the console. It implicitly knew to do this since we didn't give it any other commands. This is fine if we just want to have a quick look at things in an ad hoc manner but when it comes to writing productionised code that's part of a longer program, it's best practice to be explicit about what we want R to do. We can do this by putting the print() function around our string which explicitly tells R we want our string printed:
The other way we can write text without R trying to run it as code is to tell R it's a comment using the hash # symbol. Comments are bits of text in our program that don't get run by R but can be read either by ourselves or other users as handy notes or reminders to explain what bits of the program are doing. R knows anything after the # is not to be run. If you need multiple lines of comments you need a # at the start of each one
Creating objects with assignment <-
Up until now we've been running our code and printing our results to the console. R also allows us to save these outputs into objects that can be used repeatedly and persist until we remove them or close our session. One thing to note about how R saves objects is that they're all stored in memory. This makes them fast to access but can cause issues if you try to read in very large data sets!
In R we create objects using the assignment operator '<-' and there's a shortcut to create it in R Studio ('cntrl' + '-'). It looks like an arrow and the bit on the left is the name of the object and the bit on the right is what we want to assign to the object. Say for instance we wanted to save the output from our previous equation to an object called 'answer'. We'd write this as follows:
If we run it we see that nothing gets printed to the console this time but we can see a new object has been created in the environment called 'answer' which holds the value 11 i.e. the answer to our equation:
Rather than hold the equation itself in the object, R resolves what is on the right hand side and then saves that output into the object. So if we then ask R to run this new object we see that it now prints out the value 11:
This is super handy as it means we can use this new object in subsequent bits of code and because R resolves it to a number we can use it just like we would any other number in our code. For example let's say we want to add +5 to it:
Like earlier though, we are only printing the value to the console when we run this code. The actual value of our object 'answer' is unchanged:
If we want to overwrite it we can do so simply by assigning a new value to it. This can even make use of the current value of 'answer' due to how R resolves the value and then assigns it:
So now we've overwritten 'answer' with the value 16. Let's create a second object and assign it the value of 4 and try adding both objects together:
This works as each object simply holds a number and R is clever enough to recognise this. It then allows us to treat them in our code as if they were numbers. The same principle holds for assigning any value to an object. For example:
We can create another object with a different text string and then combine them using paste(). Whereas before our objects were numbers so we could use standard numeric operations on them e.g. '+' now our objects are strings they need string specific functions.
Paste is simply an R function that concatenates (joins together) different text strings. The sep=' ' option specifies that we want to join our strings together with a space ' ' in between them. We simply call the two objects in paste separated by a comma to tell R which is the first string and which is the second string:
R is super flexible when it comes to what you can store as objects. So far we've seen how we can store individual numbers and text strings. Next up we'll look at storing multiple elements inside an object and later on see how we can even store complicated functions inside of them.
Vectors
Vectors are one of the most common data structures in R and you'll find yourself using them a lot. They come in two flavours atomic vectors and lists. We'll cover lists later so for now all references to 'vectors' can be treated as referring to 'atomic vectors'.
An atomic vector is simply a collection of values all of the same type e.g. all numbers, all strings, etc. collected into a single object. The different values that make up the vector are called its elements. For example, say we wanted to store a collection of 5 numbers in an object:
We can see that we've successfully saved our 5 values into our object 'my_vector' and when we run it we get a print out of the values. One slightly odd looking thing is the c( ) we used to create our vector. The c( ) stands for 'concatenate' and it simply tells R that everything inside the brackets should be understood as all being part of the same thing. You'll see this c( ) used a lot in R code.
As mentioned, vectors can only contain elements all of the same type, sometimes called 'homogenous types' i.e. types that are all the same. Rather than throw an error if you try to create vectors with mixed types however, R will try and be helpful by converting all the elements to what it thinks is the most appropriate type. For example if we try to include text and numbers in our vector, R converts all the elements into text:
If we want to select elements from our vector we can do this using square brackets [ ] and then specify the number or range of numbers for the elements we want to bring back. We can even feed in another vector of numbers to tell R which elements we want to bring back:
A single set of [ ] in R will always return an object of the same type as the one we used the [ ] on. In our case we're subsetting (selecting smaller pieces of the original) a vector and so what we get returned is a vector. We can also tell R to not bring back elements using the minus sign - with our [ ]:
We can modify elements in our vector by using our assign operator <- but rather than have a new object on the left, that we assign values to, we can assign values directly into the vector itself using [ ]:
As well as recording our data in a vector, we can add names to each of the elements too. We can either do this by specifying them as part of the vector creation or adding them afterwards. There's even a handy names() function that brings back the names of the elements for us:
Alternatively, if we've already got a vector and we want to add names to it we can do this using the names() function. To do this we call names() on our vector and assign the new names into it:
Adding names to vectors opens up another possible way of subsetting them too. Up until now we've been using numbers to retrieve elements e.g. my_vector[3] to return the 3rd element. Now we've added names to our vectors, we can use them to subset our vector in exactly the same way, except now we ask for the name of the element we're after:
By having all their elements be of the same type, this means lots of clever optimisation work can go on behind the scenes to make functions that work with vectors super speedy. Say for instance we want to sum up all the elements in our vector. We have a couple of options, we could write them all out and use R as a calculator or we can pass our vector to the R sum() function:
R has loads of useful functions like these. We can easily calculate the mean() value of our vector, the length() tells us how long it is i.e. how many elements are in it and we can even sort() the values in it:
Another cool feature of vectors is that you can perform 'vectorised operations' on them. 'Vectorised' just means that we apply our function or operation onto the whole vector in one go rather cycling through each individual element one at a time. For example, what if we wanted to subtract 5 from each element in our vector? We can simply write this as myvector-5 and R knows to do the subtraction to each element:
Another clever property of vectors is how they interact with each other. We can make a second vector of the same length as our original one and then try adding or dividing them.
Before, when we were subtracting 5 or multiplying by 10 i.e. using a single value also known as a 'scalar value', that operation got performed on each element. When we use a vector, R applies the function between the corresponding pairs of elements in each of the vectors e.g. element 1 of my_vector gets multiplied by element 1 of another_vector and then the second element of each and so on:
One thing worth noting here is that this works well when the vectors are the same length and R can match them up nicely. When the vectors are different lengths things get a bit more confusing. Rather than throw an error, again R tries to be helpful by recycling the shorter vector to make it the same length as the longer one. For example, let's try the above example again but with only a vector with two elements this time:
This time, rather than each element getting divided by its corresponding element from the other vector, the first and second elements are, but then the third and fourth are divided by the first and second element from the shorter vector again. The process repeats itself until all of the element from the longer vector have had the function applied. This can sometimes lead to unintended consequences if you have vectors of different lengths without realising.
Finally, we can combine our different vectors into one big vector using the c( ) just like we would to make a regular vector:
Data Types in R
Up until now we've mostly been using numeric data but there are actually many types of data in R including two types of numeric data! Below is a brief summary of the ones you'll come across most often:
-
Character / string data which we've already seen and are written with either "double" or 'single' quotes around them.
-
Doubles which are real numbers and can be decimals e.g. 12.3. By default, all numbers in R are doubles unless we tell it otherwise.
-
Integers i.e. whole numbers. These appear in the environment with an L in front of them to denote that they are integers e.g. L1, L2, etc. You might wonder why the need for two types of numbers but the extra decimal places in doubles take up room in the memory even if unused e.g. 12.000 > 12 in byte terms.
-
Logicals, sometimes called booleans, are written as TRUE/FALSE or can be abbreviated to T/F. Logicals are a super handy data type that can be used in anything from counting to filtering. Behind the scenes they are stored as integers with FALSE=0 and TRUE=1.
-
Factors are used to represent categorical data e.g. High/Medium/Low and are a bit of a blend between strings and integers. The labels of the factors e.g. High/Medium/Low appear as text but under the hood the levels of the categories are stored as integers e.g. High=1, Medium=2 and Low=3. They can require a bit more work to use sometimes but are useful for modelling and plotting.
-
There are also raw, complex, date and date time data types but we'll ignore those for now.
Let's have a look at each of these data types in turn and some of the helpful functions R provides to identify and convert between the different types:
Character data
There's a few things worth noting from the above examples. As we've already seen, R converts all the elements in a vector to be of the same type. The typeof() function tells us what type of data we're dealing with, in this case 'character'.
The is.character() function is a more specific test that allows us to explicitly test whether the elements belong to a specific type. In our example, it's testing to see if they're characters. It returns a logical value TRUE/FALSE depending on whether the data matches the type we're testing for.
The final function as.character() tells R that we want to attempt to convert our vector elements into characters. Notice in the output that TRUE/FALSE first get converted into their underlying 1/0 format which then get converted into characters. When using as.type(), R will always try to make the conversion but if it finds it can't it will return NA for that value and post a warning to the output. Let's now apply the same functions to vectors of the different data types.
Numeric / Doubles
We can see in the final example that R was unable to convert our string 'four' into a number which is fair enough. When this happens we get a message in the output telling us that 'NAs introduced by coercion'. NA is a special value in R which stands for 'not available' and is how R represents missing values. We'll learn a bit about these later on. For now let's continue with our different data types.
Integers
By default R stores all numbers as doubles so we need to use 'L' to tell R we want our numbers to be saved as integers. Although the integer vector looks the same as the numeric_vector when printed out, we can see that R knows they are in fact of different types. One final thing worth noting is that when we convert our decimal doubles to integers R doesn't round them but rather truncates/removes the decimals instead.
Logicals
Logicals can only take two values: TRUE or FALSE which can be abbreviated to T or F and behind the scenes R stores these as 1 for TRUE and 0 for FALSE. This makes them super handy as we can apply numeric functions like mean/sum to them just like we would a numeric vector.
Logicals generally come about after testing a condition against something such as above with our is.logical(). As well as specific tests such as these, we can also test more general conditions against our data. Remember that R vectorises operations so something like my_vector+5 means +5 gets added to each element? We can do the same with our test conditions.
In the below example we take a vector of numbers and then convert it into a logical vector by testing to see which elements are >10. We know that R vectorises such operations so each element is tested in turn to see if it's >10 and TRUE is returned when it is and FALSE when it's not:
As we've already seen, as R stores T/F as 1/0 behind the scenes we can use other numeric functions with our logical vector. Say for instance we wanted to know how many elements in our vector are >10. We can simply use sum() to add up all the TRUE values which will tell us how many elements met our condition:
R has lots of different logical comparisons available for us to use and we can create even more complicated conditions by combining different ones using and/or:
Later on we'll use these to see how logicals can be even more helpful when filtering our data but for now let's move onto our final data type.
Factors
The final data type we're going to look at is factors which are used for capturing categorical data. Categorical data in R is data that can only have a set number of categories and all these categories are known in advance e.g. days of the week, months of the year, names of capital cities, etc.
Factors are a bit of a mix between strings and integers which allow them to have some useful properties. To see why they can be so useful let's first try to create some categorical data without them. Let's say we're measuring something and want to capture three categories: High/Medium/Low. We could capture these as character strings, so let's create a vector of data made up of our three categories as strings:
We can see that although the categories are recorded correctly, it misses out on an important feature of our classes which is that they have an implicit order to them e.g. High > Medium > Low whereas if we sort the strings we get them in alphabetical order.
We could instead try to use numbers to represent them e.g. High=3, Medium=2 and Low=1:
Whilst using numbers captured the order correctly we've lost the handy descriptions of the different levels which is a shame. After some time away we might forget whether we encoded the values as High=1 so it sorts first or High=3 as it's the highest value. We've also got some unintended side effects as shown by the fact we can now sum or average our data which doesn't really make sense as we're meant to be dealing with categories. We'd ideally want R to recognise that asking for something like mean(c(Monday, Tuesday, Saturday)) isn't a legitimate operation!
We've also got the less obvious side effect of introducing a relative size difference between our values e.g. High is 3x the value of Low. If we were working with calendar months we could end up with December=12 being 12x greater than January=1. This can cause issues for other processes we might run on our data, such as machine learning algorithms, where the model will assume that the relative size is intended to be significant in some way. This is why factors can be so helpful in displaying categorical data. Now let's try doing it the proper way.
When using factors, R will always store the categories as character values even though the categories themselves can be text or numbers e.g. 'Monday'/'Tuesday'/'Wednesday' or 1800/1900/2000. The different values representing each category are known as the 'levels' of the factor. Behind the scenes R assigns to each level an integer value e.g. 1, 2, 3. These levels are what control things like the ordering of the factor so we can have High > Medium > Low but other R processes know to interpret the variable as a factor and so we avoid High being 3x Low like it would if we left it as an integer. To create a factor we simply pass our vector of data to the factor() function:
We can that we've successfully converted our data into a factor. It has our three levels: High Low Medium and is.factor()=TRUE. We can also see though that its type is "integer" as R has done the conversion of our factor levels into integers. However it still doesn't sort as we'd like it to!
If our factor was unordered i.e. we just want to capture the different categories but don't care in which order they appear we could leave it as is. Since we do have an order though i.e. High/Medium/Low let's edit our factor to add this in. We can do this by manually specifying the different levels and their order to our factor.
Success!
Missing data
We've already encountered missing data when we were trying to convert between data types but it's worth spending a bit of time learning about how R deals with missing data and some of its quirks. R represents missing values as NA and in his book Hadley Wickham describes NAs as 'contagious' as 'almost any operation involving an unknown value will also be unknown'.
Let's look at some examples of how NA can be 'contagious' and how to deal with them. First up, let's create a vector that's a mix of numbers and missing.
We can see our missing value gets printed out as NA. We can use the is.na() function to test which values in our vector are missing. It essentially asks the question "is this value missing?" and so returns TRUE for any missing values and FALSE for any non-missing:
Now we've confirmed we have missing data in our vector let's have a look at how it can be contagious. For example let's say we want to sum up all the values in our vector:
Well that's unexpected! What's happened here is that the NA value has caused the entire sum() to return NA. This behaviour differs from other languages, such as pandas in Python, and can be a bit frustrating. It could be argued that R's approach is purer theoretically as NA literally represents an unknown that could be any value.
Thinking of it this way, we're essentially asking R to sum: 1 + 2 + 3 + 4 + 5 + "no idea but it could be anything" and suddenly getting "no idea" back doesn't seem such a silly response (even if it makes it slightly more effort to work with). This interpretation of NA as "no idea but it could be anything" also explains this other quirk of NA:
Now this one looks even stranger! But again under the interpretation of NA as "it could be anything" then actually checking if two things, that could be anything, are the same getting "no idea" back is understandable.
Thankfully nearly all R functions come with a handy option that we can use to deal with missing values. Most aggregating function include an 'na.rm' option which we can set to TRUE to ignore/remove NA values when we call the function:
There we go!
Subsetting using logicals
Up until now we've been selecting elements (subsetting) from our vectors by either passing it numbers corresponding to the number of the elements we're after e.g. a_vector[c(1,2,3)] or the name of the element we want e.g. a_vector["a_name"]. There's another very powerful way we can subset using logicals.
When we try to subset we can think of this as akin to asking a_vector['what would you like returned?']. By way of answer we can pass it a same length vector of logicals (TRUE/FALSE) where TRUE stands for 'yes please return it' and FALSE stands 'no thanks'. So subsetting a vector using logicals e.g. a_vector[c(T,F,F,T)] would bring back the 1st and 4th element. Let's see how this works in practice:
The reason this is so powerful is that we've already seen we can convert any vector into a vector of logicals by testing it against a logical condition. We can use the fact that R applies our logical condition to every element and converts it into TRUE/FALSE depending on whether or not our condition is met. This means we get a logical vector of exactly the same length as the original which is exactly what we need to subset with.
For example say we want to bring back only the elements that are >30 from our vector. How might we do this with logicals?
We can also use the logical operators & (and), | (or), %in% (in) and ! (not) to combine and create complicated subsetting conditions that work in exactly the same way. For example let's say we want to filter for anything that is >=20 but <60 but also not 40:
We can use our logical subsetting to also update specific values in our vector. Say for instance we want to make any value<=20 negative. First we can bring back all values <=20, we can then overwrite these values with the same values multiplied by -1:
Other data structures in R
Up until now we've been using vectors as these are such a fundamental data structure in R but there are others such as lists, arrays, matrices and data frames. We'll also encounter 'tibbles', when we look at the tidyverse, which are a more modern version of a data frame. In this section we'll focus on lists and data frames as you're likely to encounter these more often, a quick look at arrays and matrices.
Lists
One of the big limitations with vectors is that they can only contain data which is all of the same type. This makes them super speedy to use but most of the data we'll be using out in the wild will be a mix of different types. This is where lists come in! Lists can contain data of any type and are often used by other functions or packages in R to hold a mix of different data. Instead of using c( ) like we did for vectors, we can create them using the list() function:
The print out looks a little odd but gives an idea of how the list is able to hold different types of data by splitting them all into their own areas. The [[1]] denotes the 1st element in the list which is the value 1 i.e. a vector c(1), [[2]] is the second element, in this case a vector c(2), and so on.
A handy function when dealing with more complicated data structures in R is the str() function which shows the underlying structure of the object:
We can see that the list took our 6 inputs and created a list with 6 separate elements. We can also add names to the elements of our list, either when we create them or afterwards using the names() function
As well as single elements, it's possible to store other data structures in lists, such as longer vectors and even other lists! This means lists can get pretty complicated. In the below example we pass three longer vectors to our list and give them names:
An analogy I find helpful for understanding how lists works is to imagine you're packing up belonging for storage or moving house.
We might have a bunch of related stuff that we put in a box together. The related stuff in the above example might be a load of logicals that we 'box up' into a vector called 'bedroom' to be put into storage. In total we've got 3 vectors/boxes that we want to put into our list:
Our list essentially does the job of our storage locker/removal van and houses our boxes. It has 3 elements which are the 3 vectors/boxes we passed to it:
All our boxes are in the list but they're separate from each other and all our original stuff - the logicals, strings, etc are still there in their boxes too. This way of thinking also introduces the helpful idea that if we want to get the stuff in the boxes first we need to go to our removals van, select the box we want and then get the right stuff out the box. The introduction of this extra step in our process of getting what we want is a handy reminder of how we subset lists as it lists works a little differently to subsetting vectors.
Whereas before we could use [ ] to extract individual elements, for lists we need to use [[ ]] instead. This is because [ ] always return objects with the same structure as what they're subsetting. So if we use [ ] on a list we'd just get another list. The [[ ]] behaves much the same as [ ] apart from it'll always try to simplify the data structure it returns. We can see the difference in each approach below:
The difference on the output is subtle but the first approach returns a list, with 1 element, which is our vector called 'kitchen' (the giveaway it's still a list is the $kitchen). In the second approach we simply get the elements of our vector: "Mug" "Saucepan" "Kettle". Think of it like this: using [ ] meant that we identified the right box but it was still in the van (a list) whereas using [[ ]] meant not only did we identify the right box but we also took it out the van and got back our box/vector.
If the idea of writing lots of [[ ]] doesn't appeal, R has another way of achieving the same result. When we have a list where the elements are named, we have the option of using the $ sign instead of [[ ]]:
Now we're successfully picking out individual elements from the list, what if we also want a subset of elements from the vector we're returning? Say we don't just want to get the box marked 'kitchen' out of the van, we now want to get 'Kettle' out of the box too. We can do this by chaining multiple subsetting statements together. For example, if [[ ]] is the equivalent of getting our box/vector out of the van then we can select from it in the normal way again using [ ]:
Before we move on to our next data structure it's worth looking at the list function's final party trick: lists of lists! Yep, the flexibility of lists extends to even being able to contain other lists inside them. Let's have a look at how it works in practice and why it isn't actually as scary as it sounds:
We can see that in our new list, 'listception', we have 2 elements, which makes sense as we passed two things to our list: a string and our removal_van list (kind of like putting our removals van on a ferry..?) We can see that our removal_van list is still there with its 3 elements and if we want to access them we do it exactly like before except this time we've got another layer of list to work through before we can get to 'Kettle':
Arrays and Matrices
So far we've been working with 1 dimensional data i.e. vectors and lists. Now we'll see how we can create data with more dimensions in the form of arrays and matrices. These structures are actually built on top of vectors, arrays and matrices are essentially just vectors with some extra row/column dimensions and as such are limited to holding data that is all of the same type. They tend not to be used a lot day to day but you might occasionally get them as outputs of other statistical functions in R e.g. a correlation matrix from corr().
Arrays are probably the least common data structure so we'll quickly look at those before before onto matrices. They're helpful when working with image data where you can read in the red/green/blue values of the pixels as a 3-dimensional array. For creating our array we have two options. We can use the array() function and feed it some data and parameters e.g. number of rows and columns or we can turn some existing data into an array by passing dimensions to it. Let's have a look at both ways of creating them:
Now we've got something that looks much closer to a data table! Notice how R filled out our array by row and then column. Arrays in R are actually n-dimensional objects meaning they can store data in more than 2 dimensions. We can see how they achieve this by adding an extra dimension option to our array creation:
Multidimensional arrays aren't all that common but it's handy to know they exist in case you do ever come across one. More likely is that any arrays you encounter will be 2D in the form of a matrix. A matrix is essentially a special case of an array that can only ever have 2 dimensions and as such is a bit nicer to work with than an array. As it's still just a vector behind the scene matrices can only hold data all of the same type but we'll see how we can hold data of different types next with data frames.
First let's create a matrix using the maxtrix() function and confirm that it's just a 2D array:
We can add some row and column names to our matrix to make it a look a bit nicer
Matrices also come with some other handy helper functions such as nrow() and ncol() that we can use to count the number of rows and columns in our matrix:
We can also append new data to our matrix using rbind() to add new rows and cbind() to add new columns. As the names suggest, these function just bind data to the matrix and we don't need to worry about join keys or sort orders like we might in other languages. As a matrix is fundamentally a vector we can use vectors to add new data. We just need to make sure they're the same length as whatever dimension of our data frame we're binding too.
For example our matrix currently has 4 rows of data and 5 columns. So to bind another row we need to make sure we have 5 values i.e. 1 for each column. To bind columns we need to make sure it has 4 elements i.e. one for each row. We can name our vectors so they match our new column and row names too:
Being built on top of vectors also means that we can filter/subset our matrix in a similar way to how we did for our vectors using [ ]. The only difference this time is that we have two dimensions to subset on. This means we need to specify a row number/name and a column number/name which we separate with a comma. For example a_matrix[1, 1] will get the element in our matrix in the 1st row and 1st column. We can also leave one side of the comma blank as a shortcut to bring back everting from that dimension e.g. a_matrix[ , 1] brings back all rows and the 1st column:
Also carried over from vectors is the ability to access each of the elements all at once when we perform operations on our matrix:
Data Frames
Arrays and matrices are built on top of vectors and as such can only hold data all of the same type. Data frames on the other hand can hold mixed types of data. Any guesses for what they might be built from? Yep, it's lists! A data frame is just a list of vectors that all have the same length. Data frames have a 2D data structure so in terms of how we work with them they're a bit of a mix between a list and a matrix. Along with the vector they're the most common data type you'll come across in R. Let's convert our matrix into a data frame and have a look at how they differ:
Apart from the structure being a lot clearer for our data frame, there are a few things worth noting. Remember a matrix is just a vector with some dimensions (rows/columns) chucked on top. We can see this in how it prints out the elements like it's one long vector with a note at the start about it's dimensions. The column and row names were also something extra we added on which is why they're split out as separate attributes attached to the matrix.
In contrast, the data frame is a bit more formalised and structured. We can see that each column prints out like its own vector and each column is stored as a group in our data frame but separate to each other (like how we saw lists work). Technically, rather than rows and columns, a data frame has observations (instead of rows) and variables (instead of columns). We can see that these are recorded at the top of the print out. The variables in a data frame also have to be named (and each one unique in the data frame) rather than be an optional extra like in the matrix.
The other big difference of course is that data frames can hold data of different types unlike matrices:
We can add rows and columns with rbind() and cbind(), like we did for matrices, and a lot of the functions that worked on matrices are carried over to data frames:
In terms of how we work with them. the main differences between matrices and data frames show up in how we subset them. There are a couple of general rules about subsetting data frames and then annoyingly there are a couple of exceptions to these rules too! These exceptions are removed from tibbles, a modern reimagining of data frames, which we'll look at later. For now it's worth being aware of the quirks of data frames so if you see strange things happening in your code you have an idea of where to look for the issue.
As our data frame has got rows and columns it's got 2 dimensions that we can subset it by. First up let's look at subsetting it in 1 dimension i.e. just asking for columns. In this case it behaves like a list:
We know we can use [[ ]] and $ to simplify the structure and get a vector back from lists and we can do this for data frames too. However if we do that we're capped at only asking for 1 column back a time:
When we want to subset by 2 dimensions (rows and columns) our data frame behaves much like a matrix would my_dataframe[rows, columns]:
The exception to the rule that subsetting in 2 dimensions always returns another data frame is if the subsetting conditions pick out a single column/variable. Then instead of a data frame, we get a vector:
As mentioned this quirk is removed in tibbles which will always return another tibble if [ ] is used. The other quirk of data frames comes from when we try to select columns using $ rather than [[ ]]. The $ decides to use 'partial matching' where if it can't find a column in the data frame with the exact name we asked for, it'll pick out ones that start with the same letters...
If else...
R has the ability to run different pieces of code depending whether conditions we pass it resolve to TRUE or FALSE. It does this using if-else statements. What makes if-else so handy is that the condition we pass to it can make reference to objects that we created earlier in our code. An if-else statement takes the form of:
The condition in the first if() is tested and if it's TRUE, then the code in the curly brackets { } gets run. If the condition is not TRUE, then the code in the catch-all else gets run. The { } tells R which bits of code are inside of the if-else. We'll see these crop up again when we look at functions. As well as if-else we can just have an if on its own on occasions where the 'else' might just be to do nothing. We can also combine our if-else statements to make longer statements. Let's give it a try:
Our objects x and y are set up so that x<y=T so we'd expect the first print() of each of the different if-else statements to run which is what we see. The final example actually combines multiple if-else statements with an 'else-if' in the middle section. You can think of this as, knowing that the first if condition was false, offering a second chance with another condition to be true before we move to the catch-all else.
Let's change our x and y values to see how the same if-else statements change their output:
We can see that by changing the values of x and y before the if-else statements we've changed which bits of code get run by them. Also notice for the first if() that nothing got printed on this occasion. This is because the if() resolved to FALSE and we'd only given it code to run when TRUE.
Functions
Functions are super handy to automate tasks and can remove the need for lots of copying and pasting in our code. They allow us to package up what can be quite long or complicated bits of code into a single function call that we can then run with different options at our leisure.
The basic format of a function in R is: the function name (so we can call it later), the arguments of the function (these allow us to pass it different bits of code to run) and the body (the main block of code that we want to run each time the function is called). A typical R a function will look something like this:
We tell R we want to create a function by typing function(). Inside the brackets are where we put our arguments and then the main body is surrounded by curly brackets { } just like for our if-else. We assign the function to an object, where we can give it a name, so that it saves the function and we can call it later. We then call our function by writing function_name() with the arguments we want to pass to it in the ( ). Let's have a look at a really simple function to see how they work in practice:
The above function just prints out whatever we pass to it. First we tell R we want to make a function and save it an object called 'my_func'. Our argument is just a simple 'x' which acts as a placeholder in the main body of our code. Typically, placeholders are short letters like 'x' or 'y' but they can be anything. If you've got lots of arguments or they have specific purposes, it's good practice to give them more informative names like 'data', 'column', etc. In the body of the function we've got the repeatable bit of code that we want to run each time the function is called which is our print(). Finally at the bottom we call our function 4 times and pass it a different value each time.
Let's now try creating a function with 2 arguments:
Functions use the order in which the arguments are created to match them to the arguments when we call the function. For example, we tell our function that our first argument is 'x' and that our second is 'y'. So when we call our function, with say my_func(1, 2), the number 1 gets assigned to the x placeholder as it occurs first and the 2 to the y as it occurs second. We can override this implicit ordering by specifying the argument names directly e.g. my_func(x=, y=) like we do in the third call to our function where we reverse the ordering by explicitly passing y= first.
On our final call we tried to add an extra option that hadn't ben specified in our original function and this resulted in an error. We can make our function accept extra arguments by using '...' in the arguments creation. The ... tells our function to accept anything extra we might pass it even if it doesn't get used when the function is run. A lot of the pre-made functions you'll encounter include ... to make them more flexible so it's good to be aware of what it's doing:
The other thing to note about functions is that they create their own environment when they run. This can sometimes cause confusion about referring to objects within the function or creating new objects with the output of a function.
For example, in our previous function we created a new object 'z' but if we have a look in our environment in RStudio there is no object called 'z'. This is because that object is created inside the function when it runs and then ceases to exist once the function is finished. It never makes it out into the global environment which is where we'd create objects normally. If we want to use the output of the function in the global environment we can do this by assigning our function call to an object in the usual way:
Functions having their own environments can also have another unusual side effect that it's good to be aware of. If we have two objects with the same name, one in the global environment and one in the function, then the function will use the one in the function which can lead to confusing results if you're not careful:
Functions will always try to look for objects in the function environment first and only if they can't find them will they then look for some them in the global environment. For example, the below function runs as expected because there is nothing in the function environment for it to use:
As well as passing the output of a function to save it, we can be more explicit about what we want returned from our function using return(). By default, a function returns the last thing it evaluated. For example, let's see what happens with the example below:
As the last thing we create is the 'something_else' object this is unfortunately what gets returned to us. We can override this by using return() which causes the function to stop running and return the value as soon as it encounters our return().
The challenge with how functions return their output, whether explicitly or by default as the last thing processed, is that it means we can only get 1 output at a time. For example the code below only ever returns the one object (either the first when use double return and the second when we have no returns).
How we can get around this is by creating a handy catch-all, composite object that can store both our outputs...a list! We can then return the list as a single output from our function and then subset it to get our separate outputs:
Loops
As well as functions, you can also perform loops in R. We can use loops to iterate along a sequence of values e.g. 1:10, through the elements of an object or while some condition is true. Looping through elements of an object will be a lot slower than using functions so you should always try to use a function first before resorting to a loop. Let's have a look at how we can loop through a simple sequence of values to start with:
We can see in the output that the loop goes through numbers 1-10 and at each iteration it runs the code between the { } which is print(i) where i is the placeholder that takes the value from the sequence. In loops it's common to use a single letter like i, short for 'iteration', but it can in fact be anything:
As well as manually specifying a sequence for our loop we can use the seq_along() function to iterate through a sequence based on the number of values in an object. For example:
As well as for() loops, which work their way through each value in a sequence, there are also while() loops. These run (potentially forever so be careful!) until some condition is met. In the example below we create an object called i and assign it the value 0. In our while() condition we then specify that our loop is to run whilst it is less than 10. In the body of the loop we then have two bits of code. One prints the value of i on the current iteration and the other adds +1 to i before the next loop begins. Let's see what happens:
We can see that the while() loop ran 10 times before out condition of i<10 became false. We can see how on the first loop i was 0 as that's the value we assigned to the object. This then gets updated in the i <- i+1 so that on the next loop i=1 and so on until on the last loop i=10 and so the while loop stops running.
As well as iterating through a sequence we can also iterate through the elements in an objects. For vector based data structures (vectors, arrays and matrices) the loop iterates over each individual element in turn. For list based structures (lists and data frames) each element is iterated over but those elements might be data structures in their own rights e.g. another list inside a list or a vector of values in a data frame. The easiest way to see the difference is
Packages and the tidyverse
Packages in R are packaged up bundles of other peoples' code that you can download and use for yourself. The long history of R and its large user base means that whatever it is you want to do in R, it's likely someone has already made a package to help! This is great news the more familiar you become with R but it can feel overwhelming when you're starting out and make the learning curve feel steeper than it needs to be.
When I first started learning R (and even now!) I would google things and often there would be multiple solutions to solve the posted problem. Each solution would potentially use a different package and syntax which left me feeling overwhelmed by how much I had to learn. It also meant my early R scripts each had about 20 packages loaded that I'd collected overtime from stack overflow.
The good news is that a lot of the most modern and powerful packages in R today all come from the tidyverse which is a collection of packages designed for data science by RStudio. This means they all operate in a consistent and coherent way and benefit from ongoing support and development from RStudio. It also means that other package developers are starting to use the styles and structure of the tidyverse in their own work.
Let's go ahead and install the tidyverse. We only need to install packages once and then each time we load a session we call it the library() to tell R to initialise it:
You might recognise some of the package names such as dplyr, ggplot2 or stringr as these are some of the most commonly used packages in R and all members of the tidyverse.
For now, we'll finish with a quick mention of tibbles which are a tidyverse upgrade on data frames. The idea behind tibbles is that they display data is a nicer way than data frames and they also have some extra benefits behind the scenes that make working with them a bit easier too. Let's first see how tibbles look compared to data frames:
We can see from the print out that our tibble looks pretty similar to our data frame apart from a few changes. The tibble let's us know it's a tibble and tells us how many rows and columns we have which is handy. It's also removed the row names from our data frame. Another small change is how the text strings appear in our 'New_col' which are left-aligned in the tibble which make them a bit easier to read.
Some of the behind the scenes improvements that make tibbles easier to work with can be seen with how they respond to subsetting. Earlier we saw that subsetting data frames has two quirks: if you use [ , ] to bring back a single column you get a vector and not a data frame and also that my_dataframe$ does partial matching on column names. Let's see how the tibble avoids both of these:
Congratulations!
Well done!. This has been a pretty technical 'instruction manual' type post but hopefully you've found it useful. If you want to start having a play around with some data in R using the tidyverse check out 'Data wrangling with the dplyr package'. If you'd like some extra reading make sure to try Hadley Wickham's (Chief Scientist at RStudio) free online book on using the tidyverse for data science or you can buy a copy here.