3.5 Subsetting

There will be occasions where you don't want all the values in a vector/list/matrix/dataframe. Instead, you'll only want a subset. The way to do that is slightly different depending on the data structure you're using.

Note: In some programming languages, an index starts from 0. This means that you have a list or array or similar, the first value is at position 0, then 1, then 2, etc. In R, the first value is at position 1. In other words, we index from 1 in R.

3.5.1 Vectors

Vectors are simple. Just use square brackets ([] or [[]]) after your vector and provide the index or indices of the values that you want:

## [1] 10
## [1] 10 40
## [1] 10 20 30
## [1] 10

P.S. If you have a vector of named values, you can also use the names instead of the indices. Like c(value_1 = 1)[["value_1"]].

But Adam, I hear you ask, c(10,20,30,40)[1] and c(10,20,30,40)[[1]] just gave us the same thing, so are the interchangeable?

Well, they kind of returned the same thing, but they didn't. So no, they're not interchangeable.

Essentially, [] returns the container at the provided index, where [[]] returns the value at the provided index. Let's see a practical example of how these are different:

c(value_1 = 10,
  value_2 = 20)[1]
## value_1 
##      10
c(value_1 = 10,
  value_2 = 20)[[1]]
## [1] 10

In the first call, we get the name of the value and the value itself. In other words, rather than just returning the value at that index, we've essentially just chopped up the vector to only returning everything from the first position. Conversely, in the second call, we've just been given the value. What we've done here is extracted the value out from that position.

As a result of this difference, [] can be used with more that one index (e.g. [1:5] or [c(1,3)]) whereas [[]] can only be used with a single index.

It's a very subtle difference, but it is an important one. Make sure that if you want the value, use [[]], and if you want the whole part of the vector, use [].

3.5.2 Lists

Lists can be subsetted in the same way as vectors - [] returns the container at the index provided and [[]] returns the value:

  value_1 = c(1,2,3),
  value_2 = c("hello", "there", "everyone")
## [1] 1 2 3
  value_1 = c(1,2,3),
  value_2 = c("hello", "there", "everyone")
## $value_1
## [1] 1 2 3

A key difference with lists however, is that you can also subset based on the name of the value in the list using the $ operator:

  value_1 = c(1,2,3),
  value_2 = c("hello", "there", "everyone")
## [1] 1 2 3

This is equivalent to:

  value_1 = c(1,2,3),
  value_2 = c("hello", "there", "everyone")
## [1] 1 2 3

Another key difference is that lists can, of course, hold recursive values. This means that subsetting a list can return another list, that can also be subsetted and so on:

  list_1 = list(
    list_2 = list(
      list_3 = "hello"
## $list_1
## $list_1$list_2
## $list_1$list_2$list_3
## [1] "hello"

And of course, you can do the same thing with the [[]] operator if you only want the value and not the container.

3.5.3 Matrices

Matrices are two dimension, meaning they can't be subsetted with a single value. Instead, we still use the [] operator, but we provide two values: one for the row and another for the column:

matrix(c(1:10), nrow = 5, ncol = 2)
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
matrix(c(1:10), nrow = 5, ncol = 2)[4,1]
## [1] 4

3.5.4 Dataframes

Dataframes can be subsetted in the same way as matrices (using the [] operator). However, dataframes can also be subsetted (like lists), using the $ operator and the name of the column:

  col_1 = c(1,2,3),
  col_2 = c("hello", "there", "everybody")
## [1] 1 2 3

Why does this approach work for dataframes? Well, as I alluded to before, dataframes store columns as lists. But technically, the dataframe itself is also stored as a kind of list, with each column being another entry in that list. So, just like we can subset lists using $, we can subset dataframes with it as well because a dataframe is like a fancy list.

3.5.5 Subsetting by criteria

Sometimes, you might not know the indices of the items you want to extract from a datastructure. Instead, you might want to do something like "extract all numbers from a vector that are less than three". To do this, we essentially find the indices of the values that match our criteria and then subset the data structure like we learned previously.

Let's look at subsetting a vector as an example:

vector1 <- c(10,15,14,20,21,50)

Let's say want to extract all of the values below 20. To find the indices of the values that match our criteria, we just use our logical operators:

vector1 < 20

This returns TRUE if the value is less than 20, and FALSE if it isn't. We can then pass this vector of TRUE and FALSEs in [] after the vector to only return the values we want:

vector1[vector1 < 20]
## [1] 10 15 14

Other data structures can also be subsetted in the same way, but for matrices or dataframes, it's easier to use something like subset or dplyr::filter() (although subset has its own limitations).

You'll notice that this is ever so slightly different to the way we were subsetting before. Previously, we were providing just the indices of the values we wanted (e.g. 1,2 and 4). But here, we're actually providing a vector of TRUE and FALSE values to indicate which values we want. The structure is slightly different, but the logic is the same.

This does mean however, that you can also provide a vector of TRUEs and FALSEs yourself manually if you wish. There are two reasons why I would avoid this however:

  1. It takes longer to write out
  2. If you don't provide the same number of logical values (i.e. TRUEs and FALSEs) as there are values in the vector, then the logical values are recycled. That means that if you have a vector that's 6 values long, and you provide a logical vector to subset it that is only three values long, then your logical vector is going to be repeated. This can lead to unwanted results:
vector1[c(TRUE, FALSE)]
## [1] 10 14 21

Here, because I've only specified two logical values, when it comes to subsetting time, those two values will be recycled to create a vector like this c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE). This is why we get three values returned instead of the expected one.

So while you can manually subset with a vector of logical values indicating whether to return that value as is returned, it's best to stay away from it.

3.5.6 Questions

  1. Why can dataframes and lists be subsetted in a similar way?
  2. What happens if you miss the last character off when subsetting a dataframe column with $ (e.g. df$co instea of df$col)? Does the same thing happen when subsetting using [[]]?