17 Summary
Overall, I hope that this book has helped you grasp the basics of the R programming language. If you’re a student, then I hope this was pitched at the correct level to challenge you. If you’re a teacher, then I hope this book helped round out your knowledge of R such that you would feel comfortable imparting your knowledge onto those that want to hear it.
If you feel that there are improvements to be made to the content or how it was taught, then please do let me know via a GitHub issue on the teacheR repository.
17.1 Answers
17.1.1 For Students
17.1.1.1 Operators
- Why do
1 != 2
and!(1 == 2)
both equal true?- The
!
operator negates the following criteria. So our first statement is1 not equal to 2
and whereas our second isnot 1 equal to 2
. But both of these evaluate to TRUE.
- The
- Reading the R documentation on logical operators, what is the difference between
|
and||
(and&
and&&
)?|
and&
perform the comparison in a step-wise method for vectors. Soc(1,2) == 1 & c(2,3) == 2
returnsTRUE
first (1 == 1 & 2 == 2
) and thenFALSE
(2 == 1 & 3 == 2
), because it evaluates the criteria for the first and second values in the vectors separately.||
and&&
operate on the first value in each vector only. Soc(1,2) == 1 & c(2,3) == 2
returns onlyTRUE
and nothing else because it only evaluates the first values in each vector against the defined criteria (i.e.1 == 1 & 2 == 2
).
17.1.1.2 Variable Assigment
- Is
.2nd
a valid name? Why/why not?- It begins with a
.
which is valid, but the first character after the.
is a digit which isn’t allowed.
- It begins with a
- Why are names like
if
,function
, andTRUE
not allowed?- These are reserved words in R. There are reserved because they are used for specific functions, and so creating variables with these names could very easily confuse things!
- Why might it be a bad idea to assign a value to a name like
mean
orsum
?mean()
andsum()
are functions in base R. Therefore, giving variables the same names can make reading a script very difficult, and might lead you to inadvertently use the incorrect value.
17.1.1.3 Data Types
- Why are
2
and2L
different?2
is stored as a double, whereas because we’ve appendedL
to2L
, we’re telling R to store it as an integer. Although, both are numeric.
- What is an ordered factor and how is it different to a character string?
- An ordered factor stores a set of groups in a set order. A character string does not necessarily represent any type of grouping and has no inherent ordering.
- Why does
as.Date("19/01/2019", format = "%d/%m/%y")
return the date 19th Jan 2020 and not 19th Jan 2019?- The
%y
part of theformat
paramter tells R that the year value is 2 digits long. R therefore parses the date as19/01/20
and ignores everything after it - resulting in 19th Jan 2020.
- The
- Why does
as.numeric(TRUE)
return 1? What willas.logical(2)
return?TRUE
andFALSE
have historically been stored as1
and0
respectively, soas.logical(1)
will also return TRUE. R treats any numeric value greater than 0 asTRUE
, soas.logical(2)
will return TRUE.
- Look at the
identical()
function. How is that different from==
?- The
identical()
function does not use implicit conversion whereas==
does. You can test this by comparingidentical(1, 1L)
and1 == 1L
.
- The
17.1.1.4 Data Structures
- If I want to store a set of integers, what data structure should I use and why?
- A vector would be most appropriate. All values are of the same type, and so a list would be unnecessary (but still valid).
- Reading in Excel and .csv files into R will convert them into data.frames. Why do you think this is?
- Excel and .csv files have a tabular structure (they are similar to tables), and often have columns of different types. As data frames fulfill both of these requirements, data frames are the default data structures for this type of data.
- What does
is(matrix())
return? What does this tell us about the underlying difference between matrices and dataframes?- Matrices don’t store their columns as lists, but as vectors (and that’s why
is(matrix())
returnsvector
but notlist
). As a vector stores atomic values that must all be of the same type, that’s why matrices must have values of the same type whilst dataframes (that rely on columns as lists) can have values of any type.
- Matrices don’t store their columns as lists, but as vectors (and that’s why
17.1.1.5 Subsetting
- Why can dataframes and lists be subsetted in a similar way?
- Dataframes are essentially just lists of lists, so they can be subsetted like lists because they are lists!
- What happens if you miss the last character off when subsetting a dataframe column with
$
(e.g.df$co
instea ofdf$col
)? Does the same thing happen when subsetting using[[]]
?- R will find the closest match when using
$
, but will not do the same with[[]]
.$
is therefore good for interactive use (when you’re trying lots of things quickly), but not if you’re programming, because you might accidently subset the wrong column without noticing.
- R will find the closest match when using
17.1.1.6 Functions (students)
- Why does
`<-`(test, 2)
work? What does this tell us about<-
?`<-`(test, 2)
works because<-
is just another function. It assigns the second value to the first, and so we can use it like any other function (with brackets).
- Why does
mean(1,2)
not return the output you’d expect butsum(1,2)
does?mean()
is expecting a vector of values provided as thex
argument. So the correct use ofmean
would bemean(c(1,2))
.sum()
, however, uses...
to capture all arguments that are provided within the brackets and sum them together, sosum(1,2)
is appropriate.
- Other than
Sys.Date()
, can you think of another example of a function that be executed without any explicit input parameters?Sys.time()
is another in the same style.c()
andmatrix()
also do not require explicit input parameters to operate.
17.1.2 For Teachers
17.1.2.1 Functions (teachers)
- How are
mean()
andsum()
different in their implementation of...
?sum()
uses...
to accept an indeterminate number of arguments to sum.mean()
requires a vector of values to operate on, and uses...
to pass on to its subsequent methods (i.e. how exactly it calculates the mean for different data structures).
- If functions are objects, how would you construct a function that returns a function? What might be a use for this?
- Example function:
function(power = 2){ function_factory <- function(x) { ret_function <-^ power x } ret_function } function_factory(power = 2) square_function <-square_function(3)
## [1] 9
- Look at the
ellipsis
package. What issue does this package look to solve?- If you provide incorrect or misspelled arguments to a function via
...
, then these will be silently ignored, making debugging difficult. For instance, if we runmean(1,2,3)
, which looks like perfectly valid code. We get1
. This is becausemean()
expects a vector of values to calculate the mean for as the first argument (x
), with all other arguments being passed via...
. Theellipsis
package can be used to check that the parameters provided are actually used.
- If you provide incorrect or misspelled arguments to a function via
- What issues might arise when passing
...
to multiple functions inside your function?- If an argument provided in the ellipsis matches an argument name in more than one function, then every function with the matching argument name will use that value. For instance, if you pass the
...
to two functions, both of which accept anx
parameter, then the value ofx
provided to your function will be used for both.
- If an argument provided in the ellipsis matches an argument name in more than one function, then every function with the matching argument name will use that value. For instance, if you pass the
- How could you solve these issues using the
do.call()
function?- The
do.call()
function allows you to call a function and provide a list of input parameters. To define a function that passes arguments to more than one function, you could have input parameters such asfun1_options
andfun2_options
that are lists of input parameters. Then, by usingdo.call(fun1, args = fun1_options)
anddo.call(fun2, fun2_options)
, you can specify arguments to each function separately without worrying about name conflicts with...
.
- The
- Why might one use the
missing()
approach instead of assigning a default value? What are the drawbacks of this?missing()
allows you to handle errors in more detail and lets you set more complex or conditional default values. However, usingmissing()
in this way means that someone who uses your function might not be able to tell what the default value is going to be without seeing the body of the function!
17.1.2.2 Environments
- In what situations can we have two environments with the same name? Why is this?
- As long as those two environments don’t share the same parent environment, then they can have the same name. This is because they will still both have unique identifiers by virtue of having different inherited environments.
- Search ‘namespacing’. How does that concept relate to environments?
- Namespacing is the concept of naming objects to allow them to be uniquely identifiable. Environments allow for namespacing by encapsulating names within a scope. Then, even if the name isn’t unique, the pattern of inheritance and the name will be together, so the object will be uniquely identifiable.
- What might be an issue with create a function that uses superassignment on an object with the name
x
?x
is quite a common variable name. If someone is using your function and doesn’t realise that you’re using super assignment with an object namedx
, then you might overwrite theirx
variable.
17.1.2.3 Objects and classes
- Why does
is(1L)
return integer, then double, then numeric in that order?- integer inherits from the double class, which in turn inherits from numeric. Integer inherits from double because all integers are doubles but not all doubles are integers. Double inherits from numeric because all doubles are numeric but not all numerics are doubles.
- Now we understand inheritance, why can data frames be subsetted with a
$
?- Data frames inherit from the
list
class, meaning that methods that operate on lists (should) also work on data frames, like$
.
- Data frames inherit from the
- If we create a dataframe and then call
print.default()
on it, why do we get the output that we do? Why does it look more like a list than a dataframe?- When we print a data frame normally,
print.data.frame()
is called. When we callprint.default()
, we’re using a different method that prints the object differently (more akin to a list).
- When we print a data frame normally,
- Why are constructors not a fool-proof way of making sure that all our objects of our class will have the same structure? How else might one create an object of class
address
?- Because we can change the class of an object with assignment, someone could create an object that didn’t match your constructor function and then assign the class like this:
class(not_an_address) <- "address"
- Because we can change the class of an object with assignment, someone could create an object that didn’t match your constructor function and then assign the class like this:
methods("mean")
returns methods for dates and datetimes (and.default
which operates on numeric objects), but no other object. Why is this? What does this mean when constructing methods for a new generic?- These are the only data types for which a
mean
function would make sense. For instance, finding the mean for a bunch of character strings wouldn’t really make much sense (although you could build one if you wanted). This means that you only really need to create methods for your generic that make sense (i.e. that are appropriate for what you want your generic to do).
- These are the only data types for which a
17.1.2.4 Expressions
- Why might an expression like this
fun(x)
be useful? Particularly, when paired withsubstitute()
. + By using subsitute, we can replacefun
with any function we like:
substitute(fun(x), env = list(fun = sum, x = 1))
quoted_fun <- quoted_fun
## .Primitive("sum")(1)
eval(quoted_fun)
## [1] 1
- What’s the difference between
quote({
1
x <-+ 10
x })
and
list(
quote(x <- 1),
quote(x + 10)
)
and why do they evaluate to different things?
+ The first is a single expression. When we evaluate the first, both lines x <- 1
and x + 10
are evaluated at (roughly) the same time:
eval(
quote({
1
x <-+ 10
x
}) )
## [1] 11
+ The second is a list of separate expressions. If we evaluate them using:
lapply(
list(
quote(x <- 1),
quote(x + 10)
),
eval )
## [[1]]
## [1] 1
##
## [[2]]
## [1] 11
Then each line is evaluated separately, resulting in two lines instead of 1.
17.1.2.5 If / Else
- Rather than writing
if (x == 1 | x == 2 | x == 3)
, how could you use the%in%
operator to make it shorter?if (x %in% c(1,2,3))
- How does the
switch()
function relate to if / else statements?switch
takes an expression that returns a character string or number and then evaluates the appropriate argument depending on the value. Theswitch
function therefore operates like a string ofif else (x == x)
statements.
17.1.2.6 Iteration
- In what situation would you use
i in x
overi in seq_along(x)
?i in x
is used to access the value at each position inx
, whereasi in seq_along(x)
will go along the indices. So if you want to access the value (and you’re not doing something like incrementing another variable)
- Why is
seq_along(x)
preferable to1:length(x)
?- If you have something that has a length of 0, then
1:length(x)
will return1 0
, which isn’t what you want as you’ll go from the first index (which won’t exist anyway and so will error) and then to the 0th index.seq_along(x)
will return an empty integer object instead, which is preferable.
- If you have something that has a length of 0, then
- What would be the for loop code that would be required to replicate
lapply(number_list, FUN = max)
? Which would be easier to debug?- Loop code:
list(rep(NULL, length(number_list))) max_list <-for (i in seq_along(number_list)) { max(number_list[[i]]) max_list[[i]] <- }
17.2 Next Steps
17.2.1 opeRate
To help you learn how to apply the skills and understanding you’ve acquired through this book, I’ve developed a second book, opeRate. opeRate was developed in exactly the same way as teacheR, but focuses on applied data analysis and practical learning. If you’re interested in applying your knowledge, then I would recommend giving it a read.