3 For Students

This section of the book is aimed at those who are either complete beginners in data analysis and statistical computing or who have some prior experience with another language or software package and want to learn more about R.

Here, we’ll take a look at the basics but without going into too much detail so as to be confusing. If you make it through this section, when you come out the other side you’ll have more than enough knowledge to be able to complete your first real R project.

For those of you who feel you want to understand what underpins the concepts and code we’re going to look at, or those who want to eventually teach R to others, the following section (“For Teachers”) will go into more detail. For example, you’ll need to understand what a function is to use R, but you won’t need to know how to create one. In the “For Students” section, we look at what a function is, and then we learn how to create one in the “For Teachers” section.

3.1 Operators

Operators perform an action or represent something. For example, a great example of an operator is +. The + is a type of arithmetic operator that adds things together.

In this section, we’re going to look at the more common arithmetic operators that are used for simple maths in R, and then at some logical operators that are used to evaluate whether a criteria has been fulfilled.

3.1.1 Arithmetic operators

At the base of lots of programming languages are the arithmetic operators. These are your symbols that perform things like addition, subtraction, multiplication, etc. Because these operations are so ubiquitous however, the symbols that are used are often very similar across languages, so if you’ve used Excel or Python or SPSS or anything similar before, then these should be fairly straightforward.

Here are the main operators in use:

2 + 2  # addition
## [1] 4
10 - 5 # subtraction
## [1] 5
5 * 4 # multiplication
## [1] 20
100 / 25 # division
## [1] 4

3.1.2 Logical operators

Logical operators are slightly different to arithmetic operators - they are used to evaluate a particular criteria. For example, are two values equal. Or, are two values equal and two other values different.

To compare whether two things are equal, we use two equal signs (==) in R:

1 == 1 # equal
## [1] TRUE

Why two I hear you say? Well, a bit later on we’ll see that we use a single equals sign for something else.

To compare whether two things are different (not equal), we use !=:

1 != 2 # not equal
## [1] TRUE

The ! sign is also used in other types of criteria, so the best way to think about it is that it inverts the criteria you’re testing. So in this case, it’s inverting the “equals” criteria, making it “not equal”.

Testing whether a value is smaller or larger than another is done with the < and > operators:

2 > 1  # greater than
## [1] TRUE
2 < 4 # less than
## [1] TRUE

Applying our logic with the ! sign, we can also test whether something is not smaller or not larger:

1 >! 2 # not greater than
## [1] TRUE
2 <! 4 # not less than
## [1] FALSE

Why is the ! sign before the equals sign in the “not equal” to code, but after the “less than/greater than” sign? No idea. It’d probably make more sense if they were the same, but I suppose worse things happen at sea.

There are three more logical operators, and they are the “and”, “or”, and “xor” operators. These are used to test whether at least one or more than one or only one of the logical comparisons are true or false:

1 == 1 | 2 == 3 # or (i.e. are either of these TRUE)
## [1] TRUE
1 == 1 & 2 == 3 # and (i.e. are these both TRUE)
## [1] FALSE

The xor operator is a bit different:

xor(1 == 1, 2 == 3) # TRUE because only 1 is
## [1] TRUE
xor(1 == 1, 2 == 2) # FALSE because both are
## [1] FALSE

For xor(), you need to provide your criteria in brackets, but this will make much more sense once we look at functions.

3.1.3 Questions

  1. Why does 1 == "1" return FALSE?
    • Hint: the answer is revealed in the (data types)[data-types] chapter
  2. Reading the R documentation on logical operators, what is the difference between | and || (and & and &&)?

3.2 Variable assignment

Do you ever tell a story to a friend, and then someone else walks in once you’ve finished and so you have to tell the whole thing again?

Well, imagine after the second friend walks in, another friend comes in, and you have to start the story over again, and then another friend comes in and so on and so forth. What would be the best way to save you repeating yourself? As weird as it would look, if you wrote the story down then anyone who came in could just read it, rather than you having to go through the effort of explaining the whole thing each time.

This is essentially what we can do in R. Sometimes you’ll use the same value again and again in your script. For example, say you’re looking at total expenditure over a year, the value for the amount spent would probably come up quite a lot. Now, you could just type that value in every time you need it, but what happens if the value changed? You’d then have to go through and change it every time it appears.

Instead, you could store the value in a variable, and then reference the variable every time you need it. This way, if you ever have to change the value, you only need to change it once.

3.2.1 Creating variables

Creating variables in R is really easy. All you need to do is provide a valid name, use the <- symbol, and then provide a value to assign:

hello_im_a_variable <- 100
hello_im_a_variable
## [1] 100

Now, whenever you want to use your variable, you just need to provide the variable name in place of the value:

hello_im_a_variable / 10
## [1] 10

You can even use your variable to create new variables:

hello_im_another_variable <- hello_im_a_variable / 20
hello_im_another_variable
## [1] 5

When you come across other people’s work, you may see that they use = instead of <- when they create their variables. Even though it’s not the end of the world if you do do that, I would recommend getting into the habit of using <-. <- is purely used for assignment, whereas = is actually also used when we call functions, and so it can get a bit confusing if you use them interchangeably.

As a side note, you’ll see that the value of the variable isn’t outputted when we assign it. If we want to see the value, we need just the name.

3.2.2 Naming

Naming objects and variables in R mostly comes down to preference. There are some hard and fast rules that need to be followed which we’ll discuss and also a few common naming conventions but which one you use is up to you.

3.2.2.1 Valid names

R is pretty lenient when it comes to names but there are some red lines:

  • Names must start with a character or a dot (but then the second character can’t be a digit)
  • Names can only contain letters, numbers, underscores, and dots

Similar to this, there are some reserved words that can’t be used as object names:

  • break
  • if
  • else
  • FALSE
  • TRUE
  • for
  • function
  • Inf
  • NaN
  • NA
  • next
  • repeat
  • return
  • while

As a sidenote, names are case sensitive. That means that you can have two objects called test and Test that can be referred to separately. Generally, this isn’t the best idea.

3.2.2.2 Naming conventions

3.2.2.2.1 Nouns and verbs

Roughly speaking, it’s advisable to name your variables as nouns and your functions (which we get to later) as verbs. This is because variables can be considered things whereas functions do things.

For example, an appropriate name for the energy-based dataset you’re working on might be energy_dataset. This is descriptive and unique. An example of good function names are the sum() and mean() functions; what they do is easily disseminated from their names.

3.2.2.2.2 Multiple words

Sometimes, you’ll want to use names that have more than one word, like our energy_dataset example. If you’re convinced that the best way to do this is to include an actual space, you can create objects with spaces in their names by surrounding the name in backticks `:

`dont call your variable this` <- 1

Please don’t ever do this. It will just make things 100% more complicated down the line. Instead, I highly recommend that you use camel case (EnergyDataset), _s (energy_dataset) or .s (energy.dataset).

Personally, I use _s because camel case is more difficult to read at a glance and there is a group of R functions that use . in their name and you don’t want to get confused with those, but it’s really just a preference. I would only say that it’s better to be consistent than to choose the right convention.

3.2.3 Reassigning variables

Variables are very flexible. You can overwrite a previously defined variable just be reassigning a new value to the same name:

variable_1 <- 100
variable_1 <- "I'm not 100 anymore"
variable_1
## [1] "I'm not 100 anymore"

R will also give the variable an appropriate type based on the value you assign. So for example, if you assign 20 to a variable, then that variable will be stored as a number. If you assign something in quotation marks like "hello", then R will store it as text.

Let’s look in a bit more detail at the different data types…

3.2.4 Questions

  1. Is .2nd a valid name? Why/why not?
  2. Why are names like if, function, and TRUE not allowed?
  3. Why might it be a bad idea to assign a value to a name like mean or sum?

3.3 Data types

Data can be stored in lots of different forms. For example, "TRUE" and TRUE are stored as two different types, even though they look very similar to us.

The main different data types are:

  • logical

    • TRUE
    • FALSE
  • double (numeric)

    • 12.5
    • 19
    • 99999
  • integer (numeric)

    • 2L
    • 34L
  • character

    • “hello”
    • “my name is”
  • factors

  • dates

    • 2019-06-01
  • datetime (POSIXct)

    • 2019-06-01 12:00:00

Let’s have a look at each one in detail:

3.3.1 Logical

A logical variable can only have two real values, TRUE or FALSE. I say two real values, because you can also have things like NA, but that’s true of any data type. We look at NA values a little bit later on.

Logical variables are used a lot in response questionnaires, where the answer to the question is either “Yes” or “No” (TRUE or FALSE). I would recommend converting any character strings like “Yes” or “No” or “TRUE” or “FALSE” to a logical variable rather than leaving them as characters, because it’ll make your analysis less verbose (use fewer lines of code), even if it doesn’t change the underlying logic.

To test whether something is stored as logical, we use the is.logical() function:

is.logical(TRUE)
## [1] TRUE
is.logical("TRUE")
## [1] FALSE

To convert a value to logical, use the as.logical() function:

as.logical(1)
## [1] TRUE
as.logical(0)
## [1] FALSE
as.logical("TRUE")
## [1] TRUE
as.logical("FALSE")
## [1] FALSE

Be careful though, just because a conversion seems obvious to you, doesn’t mean you’ll get the expected result! For instance, what do you think as.logical(2) should return? See for yourself.

3.3.2 Double

The best way to think of a double value is as a number. It can be a whole number (but see Integers) or a decimal. R will often take care of any implicit number conversion that needs to be done under the hood, so the only thing you really need to keep in mind is that when you assign a number, be it a whole number or decimal, it will be stored as double by default.

As an aside, it’s called double because it’s stored using double precision.

To check whether a value is stored as double (or more generally numeric), use the is.double() and is.numeric() functions:

is.double(2)
## [1] TRUE
is.numeric("not numeric")
## [1] FALSE
is.double(2L) # see the next section for why this returns FALSE
## [1] FALSE

To convert a value to a double, use the as.double() or as.numeric() functions:

as.double("5")
## [1] 5
as.numeric("10")
## [1] 10
as.double("im going to cause a warning")
## Warning: NAs introduced by coercion
## [1] NA

3.3.3 Integer

Whilst also storing numeric data (like double), integers are specific to whole numbers. Also, by default, even when you assign a whole number, like this: number <- 1, R will store that value as double rather than as an integer. To store something explicitly as an integer, suffix the value with an L, like this: number <- 1L. Attempting to store something that isn’t an integer as an integer will result in a warning:

1.5L
## [1] 1.5

For the most part, I let R take care of how it stores numbers, unless I explicitly need it to be of a certain type. That is pretty rare though.

To check if something is an integer, use the is.integer() function:

is.integer(2)
## [1] FALSE
is.integer(2L)
## [1] TRUE

To convert to an integer, use the L suffix or the as.integer() function:

1L
## [1] 1

3.3.4 Character

Sometimes called characters, or character strings, or just strings, characters store text. If you assign a value within quotation marks, regardless of what’s inside the quotation marks, it will be stored as character. For example, "5" stores a character string with the text “5”, not the number 5. This is particularly important when you want to start combining variables. For example, {r, error = TRUE} "5" + 5 doesn’t work, because you’re trying to add text to a number, which doesn’t make sense.

To check whether something is stored as a character, use the is.character() function:

is.character("hello")
## [1] TRUE
is.character(5)
## [1] FALSE
is.character(TRUE)
## [1] FALSE

To convert something to a character, use the as.character() function:

as.character(5)
## [1] "5"
as.character(TRUE)
## [1] "TRUE"

3.3.5 Factors

Factors are a unique but useful data type in R. Essentially, factors store different levels that represent some sort of grouping. For example, say you were collecting some information on people from different countries, the column that holds which country the respondent is from could be stored as a factor, with the levels England, Spain, France, etc.

A factor level is made up of two things. A label and a number that represents that group. In my countries example, our factor would have the labels “England”, “Spain”, “France” and the values 1, 2, 3. This means that internally, a factor is essentially a collection of integers representing the level position and character strings representing the level label.

To create a factor, we just use the factor() function:

factor(c("England", "France", "Spain"))
## [1] England France  Spain  
## Levels: England France Spain

It’s also worth remembering that you can have levels that don’t appear in the data you have. For example, in a questionnaire, you may provide the options “None”, “Some”, “All”. But in your responses, you may see that no-one chose the “None” option. In that case, you would still create a factor with three levels, even though only two of them appear.

You can also specify whether a factor is ordered. You would use an ordered factor when the levels have meaningful order. For instance, in the above example, it would make sense that “Some” is better than “None”, and “All” is better than “Some”. To create an ordered factor, just specify ordered = TRUE in your function. By default, the factor will be ordered in the order the values appear, unless you specify levels (see below).

To convert something to a factor, use the factor() function if you want to specify levels and labels, or as.factor() to do it for you:

factor(c("Some", "All"), levels = c("None", "Some", "All"))
## [1] Some All 
## Levels: None Some All
factor(c("Some", "All"), levels = c("None", "Some", "All"), ordered = TRUE)
## [1] Some All 
## Levels: None < Some < All
as.factor(c("Some", "All"))
## [1] Some All 
## Levels: All Some

Notice the difference in the output of those three lines. The first allows us to specify the levels (i.e. the values that were possible). The second does the same but we also specify the ordering of the levels, and the third just converts the provided values and generates the levels based on that data.

Note: An important change in R version 4.0.0 is that R will no longer automatically convert strings (characters) to factors when you import data using data.frame() or read.table(). Prior to 4.0.0, it would automatically convert strings to characters unless otherwise specified.

3.3.5.1 Converting from factors

Sometimes you’ll need to convert data from a factor to something else, usually a character. This is fairly straightforward using the tools we’ve already seen:

as.character(factor(c("Some", "None", "All")))
## [1] "Some" "None" "All"

3.3.6 Dates

Dates in any language are tricky. Different countries store dates in different formats and different bits of software store dates in different ways (looking at you Excel). This can make storing values as dates tough.

The most common way of creating a date is to use the as.Date() function. To use this function, you just need to provide your date as a character string:

as.Date("2019/01/01")
## [1] "2019-01-01"

But Adam, how does R know which one is the month and which is the day? Good question, thank you for asking. By default, R expects your character string to be in the order “Year/Month/Day”. If you don’t provide it in that format, you’ll get a nonsense output:

as.Date("01/12/2019")
## [1] "1-12-20"

If your data is in a different format however, you can specify the format:

as.Date("01/12/2019", format = "%d/%m/%Y")
## [1] "2019-12-01"

Here, we’re telling R that the string is in the format “Day/Month/Year”. A list of the different codes that can be used in the format parameter can be found here, or by typing “R date codes” into Google.

Because nothing in life is simple, sometimes you’ll get some data that has the date stored as a number. This is because the source of that data has the date stored as the number of days that have passed since an origin date. Because it’s a number, our as.Date(..., format = ...) doesn’t work. Instead, we can still use the as.Date() function, but we need to specify what the origin date is that the number refers to.

By default, when importing from Excel in Windows, the origin date is December 31st 1899. More commonly, the date January 1st 1970 (also known as the epoch date) is used.

Anyway, to specify your origin, we use the origin parameter, like this:

as.Date(18262, origin = "1970/01/01")
## [1] "2020-01-01"

Notice the format I’ve provided the origin in. It’s the same as the default that R expects, and I would recommend copying that format wherever possible. If you’re someone who just wants to watch the world burn, then you can specify a format for your origin as well…’

as.Date(18262, origin = as.Date("01/01/1970", format = "%d/%m/%Y"))
## [1] "2020-01-01"

but where’s the humanity in that?

Testing whether something is a date is not as simple as the other data types unfortunately. Instead, we just use the is() or class() functions. If the first value returned is “Date”, then you know it’s a date:

is(as.Date("2020/01/01"))
## [1] "Date"     "oldClass"
class(as.Date("2020/01/01"))
## [1] "Date"

3.3.7 Datetimes (POSIXct)

If you thought dates were annoying, datetimes are like dates’ little brother who didn’t get enough attention as a child and so acts up all the time. One of the reasons for this is that datetimes aren’t actually called datetimes. They’re called POSIXct in R. So whenever you see that dreadful word, just remember “ah, Adam told me that means datetime” and you’ll be fine.

Another thing that makes datetimes tough is that in addition to dates, datetimes (as you may have guessed) also store the time. The issue with that is that time is a more relative concept - there are lots of different time zones, so how do you know which one you’re referring to? By default, R has a locale for where you currently are and will use that location for your timezone. You override that default using the Sys.setlocale() function, or you can use the tz parameter when creating your datetime as we’ll see below.

With these annoyances aside however, creating datetimes isn’t all that different to creating dates except that we use the as.POSIXct() function instead. We just provide a character string (with a format specification if necessary), or a number with an origin. One important departure from dates though, is that now our origin is in seconds, not days, to allow us to calculate the time.

as.POSIXct("2020/01/01 12:00:00")
## [1] "2020-01-01 12:00:00 UTC"
as.POSIXct("2020/01/01 12:00:00", tz = "NZ")
## [1] "2020-01-01 12:00:00 NZDT"
as.POSIXct(1577880000, origin = "1970/01/01")
## [1] "2020-01-01 12:00:00 UTC"

Similar to dates, there is no as.POSIXct() function in base R, so we use the is() and class() functions instead:

is(as.POSIXct("2020/01/01 12:00:00"))
## [1] "POSIXct"  "POSIXt"   "oldClass"
class(as.POSIXct("2020/01/01 12:00:00"))
## [1] "POSIXct" "POSIXt"

3.3.8 NA and NULL

The R language has two kind of related values, NA and NULL.

3.3.8.1 NULL

NULL indicates the absence of a value. It means that a value is missing (or has length zero). A null value has no ‘type’ because it represents an absence of something, so passing a null value to any of the is.[type]() functions will return FALSE. Instead, checking whether a value is NULL is done with the is.null() function:

is.character(NULL)
## [1] FALSE
is.null(NULL)
## [1] TRUE

is.character() returns FALSE because NULL doesn’t have a data type (it’s the absence of a value), whereas is.null() returns TRUE because it is NULL.

What happens when you compare values to NULL using our logical operators? Well, this is where the idea of NULL having a length of 0 comes in. If you type length(NULL), you’ll see it will return 0. So when you’re comparing a value against NULL like this:

"Am I NULL?" == NULL
## logical(0)

You get an output that is of length 0. And that makes sense, because NULL is missing/has 0 zero length. So basically you’re then comparing something ("Am I NULL?") to nothing (NULL). Therefore, your output is nothing (or length 0).

Dealing with NULL can be a bit tricky, so the important thing to remember is that NULL represents something missing, and have length 0. Unlike, NA

3.3.8.2 NA

On the other hand, NA (not available) represents an invalid value. This most often occurs when you try to convert one datatype to another where R can’t assign an appropriate value.

NA isn’t of length 0 like NULL because it is a value, it’s just invalid. In other words, there’s something there, but it doesn’t really make sense what it is.

For example, attempting to parse a character string to a date format that doesn’t match will result in a NA value:

as.Date("10/01/20", format = "%m%Y%d")
## [1] NA

R has tried to parse a value from one type into another. The value isn’t NULL because it clearly isn’t missing, but it couldn’t be converted to a date type, so it’s NA. Unlike NULL, there are different NA values for each datatype (although they’ll all look like NA in the console). In the example above, we actually created a NA that has the Date type:

errant_date <- as.Date("10/01/20", format = "%m%Y%d")
is(errant_date)
## [1] "Date"     "oldClass"

This is because R knows what type the NA should be, but it couldn’t assign in a proper value.

The same behaviour can be observed for other data types:

as.numeric("not a number")
## Warning: NAs introduced by coercion
## [1] NA
as.logical("not a logical")
## [1] NA
is(as.numeric("not a number"))
## Warning in is(as.numeric("not a number")): NAs introduced by coercion
## [1] "numeric" "vector"
is(as.logical("not a logical"))
## [1] "logical" "vector"

To test whether something is NA, we use the is.na() function. You don’t need to worry about what type the NA is, this will test if it is an NA of any type.

is.na(NA_character_) # this will be an NA that is of type 'character'
## [1] TRUE
is.na(NA_integer_) # this will be an NA that is of type 'integer'
## [1] TRUE

3.3.8.3 Dealing with NAs

Dealing with NAs is often contextual. Attempting to perform a mathematical calculation on a vector of values that contains at least one NA will often return NA:

sum(1,2,NA)
## [1] NA
mean(c(1,NA))
## [1] NA
NA + 1
## [1] NA

In some cases, those NAs will represent real issues with the values and so removing the NAs or converting them to 0 will just mask the error without fixing it.

Alternatively, data imports can often return NA values because of differing data types or similar and so converting those values to 0 or removing them outright may be appropriate.

Ultimately, how you deal with NA values is a question that you’ll need to answer when it happens and depending on the situation. I will give you a helping hand though and say that if you want to just remove the NA values when summing or calculating an average or similar, then these functions often have an na.rm parameter than can be used to remove the NA values from the supplied list of values:

sum(1,2,NA, na.rm = TRUE)
## [1] 3

3.3.9 NaN and Infinity

A special case is NaN (not a number). NaNs are distinct to NA in that they represent a valid value. More precisely, NaN represents not real numbers (numeric values that cannot be represented with numbers). For example, dividing 0/0.

Inf and -Inf are similar constructs. They represent infinity and minus infinity respectively. They are valid values but they are not representable with numbers, so they have their own reserved words.

NaN, Inf and -Inf are all of the numeric type and do not have equivalent values in other data types.

3.3.10 Implicit Conversion

Converting between types using the as.x() functions is nice and easy because the type that you’re converting to is explicitly defined. However, there is also implicit conversion in R. That is where R basically avoids an error by converting a variable into a different data type so that the function can run correctly. Let’s look at a proper example.

paste0("I'm a character", 100)
## [1] "I'm a character100"

The paste0() function provides a pretty good example of implicit conversion in the wild. paste0() expects character strings when it’s called, so what happens when you provide it with a number like we did above? Why doesn’t it complain? Well it’s because R automtically converts it for us. It takes that number, converts it to a character and then pastes things together. This way, you can be a bit more fast and loose with your data types because R will do the type conversion for you. Maybe the best way to describe it is that it’s a bit like a programming autocorrect.

3.3.10.1 Dangers of implicit conversion

As you can imagine, this automatic conversion between types when the user didn’t specifically ask for it can cause problems. For example, when I teach R courses I often ask the attendees what they think R is going to return when I type "TRUE" == TRUE. Without fail, the majority will say that it’s going to be FALSE; one is a character string and the other is a logical value.

My heart always dies a little bit then when I show them that it actually returns TRUE. “But why?”, they rightly ask. And it’s all because of implicit conversion. When we use the == operator, R will try and coerce the variables so that they’re the same type. In this case, it successfully coerces "TRUE" to TRUE, and so returns TRUE.

There are, of course, going to be cases where this makes things easier. For instance, say you’ve got some data where there are two columns of logical values but one has incorrectly been imported as a character column. This implicit conversion will save you from having to change the character column. But implicit conversion will inevitably cause you problems in the future.

So my advice is simply to be aware of it. Don’t fight it, don’t worry about it unduly, just keep it in the back of your mind for that one time where you can’t seem to figure out why you can divide things by TRUE.

3.3.11 Questions

  1. Why are 2 and 2L different?
  2. What is an ordered factor and how is it different to a character string?
  3. Why does as.Date("19/01/2019", format = "%d/%m/%y") return the date 19th Jan 2020 and not 19th Jan 2019?
  4. Why does as.numeric(TRUE) return 1? What will as.logical(2) return?
  5. Look at the identical() function. How is that different from ==?

3.4 Data structures

It’s rare that you’re ever going to be working on a single value in R. Instead, you’re going to want to work on collections of values, like a dataset or a list or something similar. So we need to know what data structures are available in R. Here’s a list of the structures which we’ll go into more detail:

  • vectors
  • lists
  • matrices
  • data frames

3.4.1 Vectors

Vectors are simple arrays of data in a single dimension. You can think of vectors as a like a very simple list. For instance, you can store the numbers 1 to 10 in a vector. Or each word in a string of text could be stored in a vector.

One of the main things to remember about vectors however, is that they are atomic. That’s basically a fancy word to mean that each value in a vector must be a single unit. For instance, the number 1 is a single unit. But a vector containing all the numbers between 1 and 10 is not. This is in direct contrast to lists, which are recursive, and we’ll look at those next.

To create a vector, we use the c() function, which is short for concatenate. In other words, we’re pulling together lots of different values and concatenating them into one structure.

c(1,2,3,4)
## [1] 1 2 3 4

Technically speaking, even single values are stored as a vector in R, they just have length one. That’s why if you type is(1), the second things that pops up after “numeric” is “vector”. R is telling us that 1 is a number and that it’s also a vector.

3.4.1.1 Coercion

All the values in a vector must be of the same type (e.g. character, numeric, etc.). If you try and create a vector with different data types in it, you’ll see that all the values will be coerced to the same type. This is because the type of a vector is stored at the structure level (i.e., what type is the vector?), not at the individual level (i.e. what type is the value in the vector?). Let’s look at an example:

c(1, "hello")
## [1] "1"     "hello"

You can see that both values get coerced to character strings. If we try:

c(1L, 1.5)
## [1] 1.0 1.5

We see that our integer (1L) becomes a double.

This is related to the concept of implicit conversion in R. Roughly speaking, all of the values in your vector will get coerced to the most complex type.

3.4.1.2 Naming

You can name the values in a vector. To give a value a name, you can simply provide one with a = sign when you create your vector:

c(this_is_the_first_value = 1, this_is_the_second_value = 2)
##  this_is_the_first_value this_is_the_second_value 
##                        1                        2

3.4.2 Lists

Lists are similar to vectors in that they store values one after another. However, there are two main differences:

  • Lists can contain values of any type - they are recursive.

Recursion is the action of doing something again and again. We call lists recursive, because we could have a list, that contains a list, that contains a list, and so on and so forth like Russian Dolls.

  • Lists do not have to be made up of values of the same type.

So for instance, whilst a vector must always be the same, like c(1,2,3), we could have a list that looks like this:

list(1, "hello", TRUE)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] TRUE

As you just saw, we create lists using the list() function, and providing names is done the same way as it is for vectors:

list(
  first_value = 1,
  second_value = "hello"
)
## $first_value
## [1] 1
## 
## $second_value
## [1] "hello"

3.4.3 Lists vs Vectors

Given that lists and vectors are intrinsically linked, it’s very natural to wonder when to use on over the other. Well, the basic answer is to use whichever one has the requirements you need. If all of your values are of the same type and are atomic (numeric, integer, logical, etc.). If they aren’t all the same, or you need to have a list of data structures like vectors and lists rather than just single values, then use a list.

I appreciate that this answer isn’t particularly satisfactory, so let me give a real life example of when I’ve used each.

Vector

I was recently producing a simulation that I needed to run multiple times with a different value each time. The value itself was a single number ranging between 1 and 30. So I used a vector like so:

my_vector <- c(1:30)
# this is just shorthand for saying "all of the numbers from 1 to 30"

So when I ran my simulation, I had all the values I wanted to run it for in a single structure.

List

When doing data modelling, it can sometimes be helpful to create and evaluate multiple models. One way of doing that is to create multiple models and assign them to different variables:

model1 <- model(...)
model2 <- model(...)
model3 <- model(...)

The problem with this however, is that if I then want to compare the models, I’ll have to write out modelX each time. If I have 50 models or similar, it may take a while. So instead, I often store all my models in a list. The values are complex (i.e. a model isn’t just a numeric or character value) so they can’t be stored in a vector, but they can be stored in a list. This means that I keep all my models together, and if I then decide that actually I want to add more models to my list, this is significantly easier than typing out more modelX <- model(...) lines and assigning each one as a new variable.

Before moving onto the other data structures, I just want to quickly mention that in my learning experience, understanding vectors and lists is one of the most important parts of getting to grips with R. R for many is about automating analysis and reducing the amount of time taken to do something. And vectors and lists are at the heart of this. Later on, we’ll look at functions and for loops, which we can use to perform the same action or calculation on all the values in a list or vector. Together, these will be your strongest R tools.

3.4.4 Matrices

Unlike vectors, matrices are 2 dimensional. In fact, matrices resemble something a bit like a watered down version of a spreadsheet or table.

I say watered down, because matrices can only contain values of the same type. In fact, a matrix is really just a vector in 2D. For example, if I had a vector of the numbers 1 to 10, I could easily convert it into a matrix just by setting what I wanted the dimensions to be with the dim() function like so:

im_gunna_be_a_matrix <- 1:10
dim(im_gunna_be_a_matrix) <- c(2,5)
im_gunna_be_a_matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
is(im_gunna_be_a_matrix)
## [1] "matrix"    "array"     "structure" "vector"

Matrices aren’t really designed for storing complex datasets. Instead, matrices are an efficient way of storing and performing matrix mathematics on sets of numbers.

Creating a matrix is easy. You can give a vector dimensions like we did above, or you can use the matrix() function. We provide the values we want to put into the matrix, and how many rows and columns they should be split into:

matrix(c(1:4), nrow = 2, ncol = 2)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

By default, the matrix is filled by column first (i.e. it starts at column 1 and fills that column, then moves onto the next one). To change this, use byrow = TRUE.

3.4.5 Dataframes

Dataframes are the more typical dataset storage medium. They can have columns of different types (although all of types within a column need to be the same), and they resemble more of an Excel spreadsheet or table than matrices do.

To create a dataframe, we use the data.frame() function. To this function, we provide our values as columns:

data.frame(col_1 = c(1,2,3),
           col_2 = c("hello", "world", "howsitgoing"))
##   col_1       col_2
## 1     1       hello
## 2     2       world
## 3     3 howsitgoing

More specifically, R stores dataframes as essentially a list of lists, with each list representing a different column. It’s a tiny bit like the relationship between vectors and matrices; matrices are built on vectors and dataframes are built on lists. To demonstrate this, when we type…

is(data.frame())
## [1] "data.frame" "list"       "oldClass"   "vector"

The second value in the returned vector is “list”.

So at it’s heart, a dataframe is a list, and each column within a dataframe is also a list. Why is that useful to know? Well, for one, this should make things make a bit more sense when we move onto subsetting. Secondly, when you start to move onto more complicated analysis, you can utilise the features of a list to create datasets that wouldn’t be possible in something like Excel. For instance, we know that we can store models in a list. Well, let’s say we had a dataset that had data for lots of different countries and we wanted to create a separate model for each country. We could have a dataset that had the country in one column and then the model in another:

data.frame(
  country = c("England", "Spain", "France"),
  model =  I(list(model(...), model(...), model(...)))
  # The I() just tells R to leave it as a list
)

For now though, don’t worry too much about the internals. Just remember that data frames are the most flexible dataset storage medium and they’ll be what you do most of your analysis with. And if you can remember that each column is technically a list, then you’re ahead of the game.

3.4.6 Questions

  1. If I want to store a set of integers, what data structure should I use and why?
  2. Reading in Excel and .csv files into R will convert them into data.frames. Why do you think this is?
  3. What does is(matrix()) return? What does this tell us about the underlying difference between matrices and dataframes?
    • Hint: This explains why matrices need to have columns of the same type

3.5 Subsetting

There will be occasions where you don’t want all the values in a vector/list/matrix/dataframe. Instead, you’ll only want a subset. The way to do that is slightly different depending on the data structure you’re using.

Note: In some programming languages, an index starts from 0. This means that you have a list or array or similar, the first value is at position 0, then 1, then 2, etc. In R, the first value is at position 1. In other words, we index from 1 in R.

3.5.1 Vectors

Vectors are simple. Just use square brackets ([] or [[]]) after your vector and provide the index or indices of the values that you want:

c(10,20,30,40)[1]
## [1] 10
c(10,20,30,40)[c(1,4)]
## [1] 10 40
c(10,20,30,40)[1:3]
## [1] 10 20 30
c(10,20,30,40)[[1]]
## [1] 10

P.S. If you have a vector of named values, you can also use the names instead of the indices. Like c(value_1 = 1)[["value_1"]].

But Adam, I hear you ask, c(10,20,30,40)[1] and c(10,20,30,40)[[1]] just gave us the same thing, so are the interchangeable?

Well, they kind of returned the same thing, but they didn’t. So no, they’re not interchangeable.

Essentially, [] returns the container at the provided index, where [[]] returns the value at the provided index. Another way of thinking about it is that [] is a subsetting operator (you’re taking a subset of the original set) whereas [[]] is an extraction operator (you’re extracting a value out of the vector). As a real-world analogy, say you have ten salt shakers lined up in a row with salt in each one. Using the [] operator will give you the salt shaker (and the salt) you asked for. Using [[]] will give you just the salt inside the shaker, but not the shaker.

Let’s see a practical example in R of the difference:

c(value_1 = 10,
  value_2 = 20)[1]
## value_1 
##      10
c(value_1 = 10,
  value_2 = 20)[[1]]
## [1] 10

In the first call, we get the name of the value and the value itself. In other words, rather than just returning the value at that index, we’ve essentially just chopped up the vector to only returning everything from the first position. Conversely, in the second call, we’ve just been given the value. What we’ve done here is extracted the value out from that position.

As a result of this difference, [] can be used with more that one index (e.g. [1:5] or [c(1,3)]) whereas [[]] can only be used with a single index.

It’s a very subtle difference, but it is an important one. Make sure that if you want the value, use [[]], and if you want the whole part of the vector, use [].

3.5.2 Lists

Lists can be subsetted in the same way as vectors - [] returns the container at the index provided and [[]] returns just the value:

list(
  value_1 = c(1,2,3),
  value_2 = c("hello", "there", "everyone")
)[[1]]
## [1] 1 2 3
list(
  value_1 = c(1,2,3),
  value_2 = c("hello", "there", "everyone")
)[1]
## $value_1
## [1] 1 2 3

A key difference with lists however, is that you can also subset based on the name of the value in the list using the $ operator:

list(
  value_1 = c(1,2,3),
  value_2 = c("hello", "there", "everyone")
)$value_1
## [1] 1 2 3

This is equivalent to:

list(
  value_1 = c(1,2,3),
  value_2 = c("hello", "there", "everyone")
)[["value_1"]]
## [1] 1 2 3

Another key difference is that lists can, of course, hold recursive values. This means that subsetting a list can return another list, that can also be subsetted and so on:

list(
  list_1 = list(
    list_2 = list(
      list_3 = "hello"
    )
  )
)[1][1][1]
## $list_1
## $list_1$list_2
## $list_1$list_2$list_3
## [1] "hello"

And of course, you can do the same thing with the [[]] operator if you only want the value and not the container.

3.5.3 Matrices

Matrices are two dimension, meaning they can’t be subsetted with a single value. Instead, we still use the [] operator, but we provide two values: one for the row and another for the column:

matrix(c(1:10), nrow = 5, ncol = 2)
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
matrix(c(1:10), nrow = 5, ncol = 2)[4,1]
## [1] 4

3.5.4 Dataframes

Dataframes can be subsetted in the same way as matrices (using the [] operator). However, dataframes can also be subsetted (like lists), using the $ operator and the name of the column:

data.frame(
  col_1 = c(1,2,3),
  col_2 = c("hello", "there", "everybody")
)$col_1
## [1] 1 2 3

Why does this approach work for dataframes? Well, as I alluded to before, dataframes store columns as lists. But technically, the dataframe itself is also stored as a kind of list, with each column being another entry in that list. So, just like we can subset lists using $, we can subset dataframes with it as well because a dataframe is like a fancy list.

3.5.5 Subsetting by criteria

Sometimes, you might not know the indices of the items you want to extract from a datastructure. Instead, you might want to do something like “extract all numbers from a vector that are less than three”. To do this, we essentially find the indices of the values that match our criteria and then subset the data structure like we learned previously.

Let’s look at subsetting a vector as an example:

vector1 <- c(10,15,14,20,21,50)

Let’s say want to extract all of the values below 20. To find the indices of the values that match our criteria, we just use our logical operators:

vector1 < 20
## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

This returns TRUE if the value is less than 20, and FALSE if it isn’t. We can then pass this vector of TRUE and FALSEs in [] after the vector to only return the values we want:

vector1[vector1 < 20]
## [1] 10 15 14

Other data structures can also be subsetted in the same way, but for matrices or dataframes, it’s easier to use something like subset or dplyr::filter() (although subset has its own limitations).

You’ll notice that this is ever so slightly different to the way we were subsetting before. Previously, we were providing just the indices of the values we wanted (e.g. 1,2 and 4). But here, we’re actually providing a vector of TRUE and FALSE values to indicate which values we want. The structure is slightly different, but the logic is the same.

This does mean however, that you can also provide a vector of TRUEs and FALSEs yourself manually if you wish. There are two reasons why I would avoid this however:

  1. It takes longer to write out
  2. If you don’t provide the same number of logical values (i.e. TRUEs and FALSEs) as there are values in the vector, then the logical values are recycled. That means that if you have a vector that’s 6 values long, and you provide a logical vector to subset it that is only three values long, then your logical vector is going to be repeated. This can lead to unwanted results:
vector1[c(TRUE, FALSE)]
## [1] 10 14 21

Here, because I’ve only specified two logical values, when it comes to subsetting time, those two values will be recycled to create a vector like this c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE). This is why we get three values returned instead of the expected one.

So while you can manually subset with a vector of logical values indicating whether to return that value as is returned, it’s best to stay away from it.

3.5.6 Questions

  1. Why can dataframes and lists be subsetted in a similar way?
  2. What happens if you miss the last character off when subsetting a dataframe column with $ (e.g. df$co instea of df$col)? Does the same thing happen when subsetting using [[]]?

3.6 Functions

Being a functional programming language, functions are at the heart of R. We’ve already used lots of functions in the previous chapters, but now we’re going to look in more detail about what a function is.

3.6.1 Function basics

John Chambers, creator of the S programming language upon which R is based and core member of the R programming language project, said this:

"To understand computations in R, two slogans are helpful:

  • Everything that exists is an object.
  • Everything that happens is a function call.

— John Chambers"

For now, we’re going to focus on that second statement. What does it mean?

Well, a function is quite simple. It has an input, it does something, and then it gives an output. A really simple example of this is just typing print(1) into the console and hitting enter. You’ve given an input, there was a calculation, and now there’s an output. Something’s happened (1 was printed in the console) and it was done by calling a function (print).

If you’re well-versed in mathematics, you’ll know that functions in maths are the same. \(f(x) = {3x}\) means that the to get y, you take x and multiply it by three. In this case, our input is x, our bit in the middle is multiplying by three, and then our output is y.

If you haven’t used functions in mathematics then don’t worry. Even by getting this far in the book, you’ve already used functions loads of times. For example, how do you create a vector? If you remember, you use the c() function, which we know stands for “concatenate”. So, every time you’ve created a vector, you’ve used a function without even knowing it. The input was whatever you provided in the brackets. The computation was to concatenate everything together. And then the output was the vector.

Similarly, whenever you created a factor or a matrix or a dataframe or whatever, you used a function. You provided an input, there was a computation to change that input, and then you got an output.

As confusing as functions will inevitably become, just try to remember the core of what a function is: When you call a function, there’s an input, something happens, and there’s an output.

3.6.1.1 Functions in R

So more specifically, what do functions looks like in R? Well, a good starting point is that when we call (use) a function, its almost always followed by brackets (()) when you use them. This helps make it clear what values you’re providing as your inputs. For example, the c() function, the data.frame() function, the sum() function are all followed by (), which is how you provide your inputs.

I say that nearly almost all functions are followed by (), because some aren’t. A simple example of this is +. + is still a function:

is.function(`+`)
## [1] TRUE
# the backticks just mean I'm referring
# to the + function without using it

But it doesn’t have brackets. Instead, we can use a shorthand where we provide the values we want to give to the function either side of it (e.g. 1 + 2). Importantly however, the logic is exactly the same, and you can still use the + like a normal function with brackets:

`+`(1,2)
## [1] 3

It’s just that this looks a little weird to us, so we often use the shorthand way. But the long and short of it is: an easy way to tell when someone is calling (using) a function is to look for the () after the function name.

3.6.1.2 Inputs

We know that to use a function in R, we have to provide inputs*. And we also know that we provide our inputs within the brackets after the function name. But how do we know what values are allowed?

*Technically, sometimes you don’t have to provide an input to a function (e.g. Sys.Date(), which gives us the current date without putting anything in the brackets). But in the interests of clarity, just imagine that the inputs to these functions are blank rather than that they don’t have any input at all.

By typing a ? followed by the name of the function into the console (e.g. ?length()), you’ll get a help page showing you the input parameters allowed by the function. So if we use ?length() as an example, the help page tells us that the length() function expects one input parameter, x, and that needs to be an R object. Nice and simple.

In some cases, you’ll see a ... as one of the input parameters. This essentially means that you can provide an indeterminate number of values for that input. I know that sounds confusing, but the c() function is a good way of demonstrating this. When you create a vector, you can provide an (essentially) infinite number of values to the function. So the c() function basically bundles everything you provide to it into that ... parameter.

3.6.1.2.1 Explicit input parameters

If you type ?c() into the console however, you’ll see that there are also some other input parameters: recursive and use.names. Well Adam, if ... just bundles everything I provide into a single input, then how do those work? Well this outlines the importance of providing explicit input parameters. When we’re explicit, we’re saying exactly which input parameter we’re referring to with each value we provide. And to do this, we just provide the name of the input parameter when we give it. Let’s look at the substr() function as an example.

The substr() function simply returns part of a character string that you provide. So, if I was to type:

substr("Hello", 1, 3)
## [1] "Hel"

I get the first to the third characters in the string “hello”. With this function call however, I haven’t been explicit. Instead, I’ve just provided the inputs in the order that they’re listed in the documentation:

  • x
    • a character vector
  • start
    • the first element to be extracted
  • stop
    • the last element to be extracted

To be explicit, I need to provide the name of the input parameter that I’m referring to when I provide my inputs:

substr(x = "Hello", start = 1, stop = 3)
## [1] "Hel"
substr(start = 1, stop = 3, x = "Hello")
## [1] "Hel"

Notice how, when I’m being explicit, it doesn’t matter what order I provide my inputs in, R knows which value should be mapped to which input parameter.

Also, notice how we’re using = here and not anything else like <-? This is another reason why I suggest not using = for assignment: we use = when we’re providing input parameters and so it’s good to keep them separate.

So how does this link back with the ...? Well, with the c() function, every unnamed parameter you provide is bundled into the ... parameter. To give values for the recursive and use.names parameters, you’d need to provide them explicitly (e.g. recursive = TRUE). This will be true of many functions where you see a .... If you’re not explicit with the parameters that you don’t want to be included in the ..., you’re going to have a bad time.

3.6.1.2.2 Optional input parameters

For many functions, certain parameters have a predefined value that they will default to. This provides a level of flexibility whilst not requiring lines and lines of code for every function call; there’s a default value, but you can override it if needed.

Optional parameters are easily distinguished in the documentation of a function because they will a value already assigned to them like this: use.names = TRUE.

For instance, when we create a vector using the c() function, there are two optional parameters (recursive and use.names) that already have the values TRUE and FALSE assigned to them. To override these defaults, we just need to provide a new value to the parameter like this:

c(1,10,15, use.names = FALSE)
## [1]  1 10 15

3.6.1.3 Outputs

First and foremost, in R you can have as many input as you like to a function. However, a function will only ever return one thing. I say one thing, because functions can return a list which itself can contain multiple values, but just keep this in mind: Functions in R have a single return value.

3.6.1.3.1 Reassigning outputs

Functions in R do not edit the inputs you provide in place. Instead, they essentially work on copies of the inputs you provide. Here’s a quick example:

x <- 1
sum(x, 1)
## [1] 2
x
## [1] 1

As you can see, when we call the sum() function with x as an input parameter, the value of x stays the same.

If you do want to edit your original value, you just need to reassign the output of the function call back to the variable. I know that sounds complicated, but it’s quite simple:

x <- 1
x <- sum(x,1)
x
## [1] 2

This works because the right-hand side of the assignment line is executed first. In other words, when the sum(x,1) is evaluated, x is still equal to one. This makes sense because otherwise it’d be very hard to keep track of what x was equal to!

This behaviour (not changing the input parameter value in place) is a major point of difference between functions and what are called methods in other languages. If you’re coming from something like Python, you may be used to changing objects through methods: object.AddNew or something like that. In R, functions do not change variables in the global environment because they are executed in their own environment. To learn more about environments, there is an environments chapter in the “For Teachers” section.

3.6.2 Questions

  1. Why does `<-`(test, 2) work? What does this tell us about <-?
  2. Why does mean(1,2) not return the output you’d expect but sum(1,2) does?
    • Hint: the documentation of both functions will help.
  3. Other than Sys.Date(), can you think of another example of a function that be executed without any explcit input parameters?