17 Summary

Overall, I hope that this book has helped you grasp the basics of the R programming language. If you’re a student, then I hope this was pitched at the correct level to challenge you. If you’re a teacher, then I hope this book helped round out your knowledge of R such that you would feel comfortable imparting your knowledge onto those that want to hear it.

If you feel that there are improvements to be made to the content or how it was taught, then please do let me know via a GitHub issue on the teacheR repository.

17.1 Answers

17.1.1 For Students

17.1.1.1 Operators

  1. Why do 1 != 2 and !(1 == 2) both equal true?
    • The ! operator negates the following criteria. So our first statement is 1 not equal to 2 and whereas our second is not 1 equal to 2. But both of these evaluate to TRUE.
  2. Reading the R documentation on logical operators, what is the difference between | and || (and & and &&)?
    • | and & perform the comparison in a step-wise method for vectors. So c(1,2) == 1 & c(2,3) == 2 returns TRUE first (1 == 1 & 2 == 2) and then FALSE (2 == 1 & 3 == 2), because it evaluates the criteria for the first and second values in the vectors separately. || and && operate on the first value in each vector only. So c(1,2) == 1 & c(2,3) == 2 returns only TRUE and nothing else because it only evaluates the first values in each vector against the defined criteria (i.e. 1 == 1 & 2 == 2).

17.1.1.2 Variable Assigment

  1. Is .2nd a valid name? Why/why not?
    • It begins with a . which is valid, but the first character after the . is a digit which isn’t allowed.
  2. Why are names like if, function, and TRUE not allowed?
    • These are reserved words in R. There are reserved because they are used for specific functions, and so creating variables with these names could very easily confuse things!
  3. Why might it be a bad idea to assign a value to a name like mean or sum?
    • mean() and sum() are functions in base R. Therefore, giving variables the same names can make reading a script very difficult, and might lead you to inadvertently use the incorrect value.

17.1.1.3 Data Types

  1. Why are 2 and 2L different?
    • 2 is stored as a double, whereas because we’ve appended L to 2L, we’re telling R to store it as an integer. Although, both are numeric.
  2. What is an ordered factor and how is it different to a character string?
    • An ordered factor stores a set of groups in a set order. A character string does not necessarily represent any type of grouping and has no inherent ordering.
  3. Why does as.Date("19/01/2019", format = "%d/%m/%y") return the date 19th Jan 2020 and not 19th Jan 2019?
    • The %y part of the format paramter tells R that the year value is 2 digits long. R therefore parses the date as 19/01/20 and ignores everything after it - resulting in 19th Jan 2020.
  4. Why does as.numeric(TRUE) return 1? What will as.logical(2) return?
    • TRUE and FALSE have historically been stored as 1 and 0 respectively, so as.logical(1) will also return TRUE. R treats any numeric value greater than 0 as TRUE, so as.logical(2) will return TRUE.
  5. Look at the identical() function. How is that different from ==?
    • The identical() function does not use implicit conversion whereas == does. You can test this by comparing identical(1, 1L) and 1 == 1L.

17.1.1.4 Data Structures

  1. If I want to store a set of integers, what data structure should I use and why?
    • A vector would be most appropriate. All values are of the same type, and so a list would be unnecessary (but still valid).
  2. Reading in Excel and .csv files into R will convert them into data.frames. Why do you think this is?
    • Excel and .csv files have a tabular structure (they are similar to tables), and often have columns of different types. As data frames fulfill both of these requirements, data frames are the default data structures for this type of data.
  3. What does is(matrix()) return? What does this tell us about the underlying difference between matrices and dataframes?
    • Matrices don’t store their columns as lists, but as vectors (and that’s why is(matrix()) returns vector but not list). As a vector stores atomic values that must all be of the same type, that’s why matrices must have values of the same type whilst dataframes (that rely on columns as lists) can have values of any type.

17.1.1.5 Subsetting

  1. Why can dataframes and lists be subsetted in a similar way?
    • Dataframes are essentially just lists of lists, so they can be subsetted like lists because they are lists!
  2. What happens if you miss the last character off when subsetting a dataframe column with $ (e.g. df$co instea of df$col)? Does the same thing happen when subsetting using [[]]?
    • R will find the closest match when using $, but will not do the same with [[]]. $ is therefore good for interactive use (when you’re trying lots of things quickly), but not if you’re programming, because you might accidently subset the wrong column without noticing.

17.1.1.6 Functions (students)

  1. Why does `<-`(test, 2) work? What does this tell us about <-?
    • `<-`(test, 2) works because <- is just another function. It assigns the second value to the first, and so we can use it like any other function (with brackets).
  2. Why does mean(1,2) not return the output you’d expect but sum(1,2) does?
    • mean() is expecting a vector of values provided as the x argument. So the correct use of mean would be mean(c(1,2)). sum(), however, uses ... to capture all arguments that are provided within the brackets and sum them together, so sum(1,2) is appropriate.
  3. Other than Sys.Date(), can you think of another example of a function that be executed without any explicit input parameters?
    • Sys.time() is another in the same style. c() and matrix() also do not require explicit input parameters to operate.

17.1.2 For Teachers

17.1.2.1 Functions (teachers)

  1. How are mean() and sum() different in their implementation of ...?
    • sum() uses ... to accept an indeterminate number of arguments to sum. mean() requires a vector of values to operate on, and uses ... to pass on to its subsequent methods (i.e. how exactly it calculates the mean for different data structures).
  2. If functions are objects, how would you construct a function that returns a function? What might be a use for this?
    • Example function:
    function_factory <- function(power = 2){
      ret_function <- function(x) {
       x ^ power
      }
      ret_function
    }
    
    square_function <- function_factory(power = 2)
    square_function(3)
    ## [1] 9
  3. Look at the ellipsis package. What issue does this package look to solve?
    • If you provide incorrect or misspelled arguments to a function via ..., then these will be silently ignored, making debugging difficult. For instance, if we run mean(1,2,3), which looks like perfectly valid code. We get 1. This is because mean() expects a vector of values to calculate the mean for as the first argument (x), with all other arguments being passed via .... The ellipsis package can be used to check that the parameters provided are actually used.
  4. What issues might arise when passing ... to multiple functions inside your function?
    • If an argument provided in the ellipsis matches an argument name in more than one function, then every function with the matching argument name will use that value. For instance, if you pass the ... to two functions, both of which accept an x parameter, then the value of x provided to your function will be used for both.
  5. How could you solve these issues using the do.call() function?
    • The do.call() function allows you to call a function and provide a list of input parameters. To define a function that passes arguments to more than one function, you could have input parameters such as fun1_options and fun2_options that are lists of input parameters. Then, by using do.call(fun1, args = fun1_options) and do.call(fun2, fun2_options), you can specify arguments to each function separately without worrying about name conflicts with ....
  6. Why might one use the missing() approach instead of assigning a default value? What are the drawbacks of this?
    • missing() allows you to handle errors in more detail and lets you set more complex or conditional default values. However, using missing() in this way means that someone who uses your function might not be able to tell what the default value is going to be without seeing the body of the function!

17.1.2.2 Environments

  1. In what situations can we have two environments with the same name? Why is this?
    • As long as those two environments don’t share the same parent environment, then they can have the same name. This is because they will still both have unique identifiers by virtue of having different inherited environments.
  2. Search ‘namespacing’. How does that concept relate to environments?
    • Namespacing is the concept of naming objects to allow them to be uniquely identifiable. Environments allow for namespacing by encapsulating names within a scope. Then, even if the name isn’t unique, the pattern of inheritance and the name will be together, so the object will be uniquely identifiable.
  3. What might be an issue with create a function that uses superassignment on an object with the name x?
    • x is quite a common variable name. If someone is using your function and doesn’t realise that you’re using super assignment with an object named x, then you might overwrite their x variable.

17.1.2.3 Objects and classes

  1. Why does is(1L) return integer, then double, then numeric in that order?
    • integer inherits from the double class, which in turn inherits from numeric. Integer inherits from double because all integers are doubles but not all doubles are integers. Double inherits from numeric because all doubles are numeric but not all numerics are doubles.
  2. Now we understand inheritance, why can data frames be subsetted with a $?
    • Data frames inherit from the list class, meaning that methods that operate on lists (should) also work on data frames, like $.
  3. If we create a dataframe and then call print.default() on it, why do we get the output that we do? Why does it look more like a list than a dataframe?
    • When we print a data frame normally, print.data.frame() is called. When we call print.default(), we’re using a different method that prints the object differently (more akin to a list).
  4. Why are constructors not a fool-proof way of making sure that all our objects of our class will have the same structure? How else might one create an object of class address?
    • Because we can change the class of an object with assignment, someone could create an object that didn’t match your constructor function and then assign the class like this: class(not_an_address) <- "address"
  5. methods("mean") returns methods for dates and datetimes (and .default which operates on numeric objects), but no other object. Why is this? What does this mean when constructing methods for a new generic?
    • These are the only data types for which a mean function would make sense. For instance, finding the mean for a bunch of character strings wouldn’t really make much sense (although you could build one if you wanted). This means that you only really need to create methods for your generic that make sense (i.e. that are appropriate for what you want your generic to do).

17.1.2.4 Expressions

  1. Why might an expression like this fun(x) be useful? Particularly, when paired with substitute(). + By using subsitute, we can replace fun with any function we like:
quoted_fun <- substitute(fun(x), env = list(fun = sum, x = 1))
quoted_fun
## .Primitive("sum")(1)
eval(quoted_fun)
## [1] 1
  1. What’s the difference between
quote({
  x <- 1
  x + 10
})

and

list(
  quote(x <- 1),
  quote(x + 10)
)

and why do they evaluate to different things? + The first is a single expression. When we evaluate the first, both lines x <- 1 and x + 10 are evaluated at (roughly) the same time:

eval(
    quote({
        x <- 1
        x + 10
    })
)
## [1] 11

  + The second is a list of separate expressions. If we evaluate them using:

lapply(
    list(
      quote(x <- 1),
      quote(x + 10)
    ),
    eval
)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 11

Then each line is evaluated separately, resulting in two lines instead of 1.

17.1.2.5 If / Else

  1. Rather than writing if (x == 1 | x == 2 | x == 3), how could you use the %in% operator to make it shorter?
    • if (x %in% c(1,2,3))
  2. How does the switch() function relate to if / else statements?
    • switch takes an expression that returns a character string or number and then evaluates the appropriate argument depending on the value. The switch function therefore operates like a string of if else (x == x) statements.

17.1.2.6 Iteration

  1. In what situation would you use i in x over i in seq_along(x)?
    • i in x is used to access the value at each position in x, whereas i in seq_along(x) will go along the indices. So if you want to access the value (and you’re not doing something like incrementing another variable)
  2. Why is seq_along(x) preferable to 1:length(x)?
    • If you have something that has a length of 0, then 1:length(x) will return 1 0, which isn’t what you want as you’ll go from the first index (which won’t exist anyway and so will error) and then to the 0th index. seq_along(x) will return an empty integer object instead, which is preferable.
  3. What would be the for loop code that would be required to replicate lapply(number_list, FUN = max)? Which would be easier to debug?
    • Loop code:
    max_list <- list(rep(NULL, length(number_list)))
    for (i in seq_along(number_list)) {
      max_list[[i]] <- max(number_list[[i]])
    }

17.2 Next Steps

17.2.1 opeRate

To help you learn how to apply the skills and understanding you’ve acquired through this book, I’ve developed a second book, opeRate. opeRate was developed in exactly the same way as teacheR, but focuses on applied data analysis and practical learning. If you’re interested in applying your knowledge, then I would recommend giving it a read.