4.6 Iteration

Functions are an important stepping stone in reducing the amount of code you need to do something (they make your code less verbose). Another tool in acheiving this goal is the idea of iteration. Iteration is just the process of doing something more than once.

We'll often find ourselves doing the same thing again and again in programming. Calculating the means for lots of different columns, or different datasets, or making plots for different groups are all examples of operations that you'll rarely only do once. There are two approaches to this: imperative programming and functional programming. First, we'll take a look at imperative programming (for loops) as this is more akin to lots of other programming language. Later, however, we'll look at how we can better utilise the fact that R is a functional programming to solve some iteration problems more easily and with fewer errors.

4.6.1 Imperative programming

4.6.2 For loops

For loops are almost completely ubiquitous across different programming languages. They allow us to perform actions over a list or set of objects. They are flexible and explicit even if they can be difficult to understand. A simple for loop in R follows a simple structure:

for (identifier in list) {
 do_something_with_it(identifier)
}

In the above, the identifier is used to access the value that is currently being iterated upon. Let's look at a practical example:

for (i in c(1,2,3)) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3

In this loop, we go through the vector of values 1, 2 and 3, and we print it using the i identifier that we've assigned to the value we're currently iterating upon. So the code inside the loop will run three times (once for each value in the vector we've provided). The first time, i will equal 1. The code will execute, and then i will take the value of 2 and so on.

This isn't a particular useful example. Let's look at more realistic example. Let's say you've got a list with 2 dataframes in that have the same structure:

dataframe_list <- list(
                    data.frame(
                      obs = c(1,2,3),
                      value = c(10,11,9)
                     ),
                     data.frame(
                      obs = c(1,2,3),
                      value = c(100,200,150)
                     )
                   )

And you want to calculate the mean for the value column for each one:

for (df in dataframe_list) {
  mean(df$value)
}

Or, if we refer back to our function example from the Functions chapter, we could apply our function to each dataset to get a normal distribution for each. Rather than just printing those numbers, we'll also construct a list to store the output:

output_list <- list()
for (df in seq_along(dataframe_list)) {
  output_list[[df]] <- create_norm_dist_from_column(dataframe_list[[df]], "value")
}
head(output_list[[1]], 5)
## [1] 11.201910  9.585865 10.117422  9.964396  8.369581

In this case, rather than looping through the values in the dataframe_list list, we're looping through the indices using seq_along(). seq_along() just creates a vector with all the indices in. So for a list with two values in it, this is almost equivalent to 1:2. Using the seq_along() approach lets us keep track of where we are in the dataframe_list, as we know exactly how many times the loop as run at any given time, so we can assign the output to the right position in output_list.

Note: If you can avoid expanding a list (i.e. increasing the size of a list by one everytime to add to it), then this is preferable. However, for very small lists, expanding a list rather than pre-defining its size and filling those values isn't really a big deal.

4.6.3 While loops

While loops are closely related to for loops. Instead of looping through an object or through the indices of an object, a while loop runs when a criteria is fulfilled:

x <- 0
while(x < 2) {
  x <- x + 1
  print(x)
}
## [1] 1
## [1] 2

4.6.4 Functional programming

For loops and its derivatives are very powerful, but they are arguably less important in functional languages like R than they are in other languages like C and Python.

Instead, within R we can leverage functions to wrap the loops. For example, let's look back at our dataframe_list example. We can turn our loop into a function that we can call whenever we have a list of dataframes:

norm_dists <- function(dfs, column = "value") {
  output_list <- list()
  for (df in seq_along(dfs)) {
    output_list[[df]] <- create_norm_dist_from_column(dfs[[df]], column)
  }
  output_list
}

head(
  norm_dists(dataframe_list)[[1]], 1
)
## [1] 10.57989

Now, whenever we have a list of dataframes we can just use our norm_dists() function and provide the column name to the column parameter. This is very closely related to the idea of vectorized functions, which we looked at the in Functions. All vectorised functions will essentially wrap around a loop in same way or another.

4.6.4.1 Applying functions

Because of its importance in R, applying multiple values to the same function and keeping track of the output has bred its own functions and methodology. More specifically, it's led to the development of the apply set of functions. These act as syntactic sugar to decrease the number of for loops present in your code*.

* A common argument is that apply functions are faster than for loops. Generally speaking, this is not true. Most times the function is just an implementation of a for loop at its core, meaning the apply functions aren't more efficient or faster. Instead, the benefit comes from the readability and cleanliness of the code.

We've already seen an example of the lapply (list-apply) function in action, but lets do another one. Let's say we have a list of vectors containing numbers and we want the maximum for each.

number_list <- list(
                  c(1,2,3,4),
                  c(500,250,100,10),
                  c(1000,1001,1001,1000)
                )

We can apply the max() function to each item in our list:

lapply(number_list, FUN = max)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 500
## 
## [[3]]
## [1] 1001

As you can see, this is much shorter than writing out a for loop to acheive the same goal.

The FUN parameter accepts any function (including anonymous) functions, meaning that we can apply any function we like to each item in our list.

4.6.4.2 purrr

The apply family of functions in base R are extremely powerful. However, they are not quite as user-friendly as one might have hoped, and there are some inconsistencies across the different apply functions. To solve this, the purrr provides the map() functions to provide the same functionality but in a more succinct and universal way.

Personally, I do prefer the purrr functions to the base R functions as I find they're easier to learn to use, however the base R functions do provide the same functionality. Documentation for the purr package can be found at https://purrr.tidyverse.org.

4.6.5 Questions

  1. In what situation would you use i in x over i in seq_along(x)?
  2. Why is seq_along(x) preferable to 1:length(x)?
  3. What would be the for loop code that would be required to replicate lapply(number_list, FUN = max)? Which would be easier to debug?