3.3 Data types

Data can be stored in lots of different forms. For example, "TRUE" and TRUE are stored as two different types, even though they look very similar to us.

The main different data types are:

  • logical
    • TRUE
    • FALSE
  • double (numeric)
    • 12.5
    • 19
    • 99999
  • integer (numeric)
    • 2L
    • 34L
  • character
    • "hello"
    • "my name is"
  • factors
  • dates
    • 2019-06-01
  • datetime (POSIXct)

    • 2019-06-01 12:00:00

Let's have a look at each one in detail:

3.3.1 Logical

A logical variable can only have two real values, TRUE or FALSE. I say two real values, because you can also have things like NA, but that's true of any data type.

Logical variables are used a lot in response questionnaires, where the answer to the question is either "Yes" or "No" (TRUE or FALSE). I would recommend converting any character strings like "Yes" or "No" or "TRUE" or "FALSE" to a logical variable rather than leaving them as characters, because it'll make your analysis less verbose (use fewer lines of code), even if it doesn't change the underlying logic.

To test whether something is stored as logical, we use the is.logical() function:

is.logical(TRUE)
## [1] TRUE
is.logical("TRUE")
## [1] FALSE

To convert a value to logical, use the as.logical() function:

as.logical(1)
## [1] TRUE
as.logical(0)
## [1] FALSE
as.logical("TRUE")
## [1] TRUE
as.logical("FALSE")
## [1] FALSE

Be careful though, just because a conversion seems obvious to you, doesn't mean you'll get the expected result! For instance, what do you think as.logical(2) should return? See for yourself.

3.3.2 Double

The best way to think of a double value is as a number. It can be a whole number (but see Integers) or a decimal. R will often take care of any implicit number conversion that needs to be done under the hood, so the only thing you really need to keep in mind is that when you assign a number, be it a whole number or decimal, it will be stored as double by default.

As an aside, it's called double because it's stored using double precision.

To check whether a value is stored as double (or more generally numeric), use the is.double() and is.numeric() functions:

is.double(2)
## [1] TRUE
is.numeric("not numeric")
## [1] FALSE
is.double(2L) # see the next section for why this returns FALSE
## [1] FALSE

To convert a value to a double, use the as.double() or as.numeric() functions:

as.double("5")
## [1] 5
as.numeric("10")
## [1] 10
as.double("im going to cause a warning")
## Warning: NAs introduced by coercion
## [1] NA

3.3.3 Integer

Whilst also storing numeric data (like double), integers are specific to whole numbers. Also, by default, even when you assign a whole number, like this: number <- 1, R will store that value as double rather than as an integer. To store something explicitly as an integer, suffix the value with an L, like this: number <- 1L. Attempting to store something that isn't an integer as an integer will result in a warning:

1.5L
## [1] 1.5

For the most part, I let R take care of how it stores numbers, unless I explicitly need it to be of a certain type. That is pretty rare though.

To check if something is an integer, use the is.integer() function:

is.integer(2)
## [1] FALSE
is.integer(2L)
## [1] TRUE

To convert to an integer, use the L suffix or the as.integer() function:

1L
## [1] 1

3.3.4 Character

Sometimes called characters, or character strings, or just strings, characters store text. If you assign a value within quotation marks, regardless of what's inside the quotation marks, it will be stored as character. For example, "5" stores a character string with the text "5", not the number 5. This is particularly important when you want to start combining variables. For example, {r, error = TRUE} "5" + 5 doesn't work, because you're trying to add text to a number, which doesn't make sense.

To check whether something is stored as a character, use the is.character() function:

is.character("hello")
## [1] TRUE
is.character(5)
## [1] FALSE
is.character(TRUE)
## [1] FALSE

To convert something to a character, use the as.character() function:

as.character(5)
## [1] "5"
as.character(TRUE)
## [1] "TRUE"

3.3.5 Factors

Factors are a unique but useful data type in R. Essentially, factors store different levels that represent some sort of grouping. For example, say you were collecting some information on people from different countries, the column that holds which country the respondent is from could be stored as a factor, with the levels England, Spain, France, etc.

A factor level is made up of two things. A label and a number that represents that group. In my countries example, our factor would have the labels "England", "Spain", "France" and the values 1, 2, 3. This means that internally, a factor is essentially a collection of integers representing the level position and character strings representing the level label.

To create a factor, we just use the factor() function:

factor(c("England", "France", "Spain"))
## [1] England France  Spain  
## Levels: England France Spain

It's also worth remembering that you can have levels that don't appear in the data you have. For example, in a questionnaire, you may provide the options "None", "Some", "All". But in your responses, you may see that no-one chose the "None" option. In that case, you would still create a factor with three levels, even though only two of them appear.

You can also specify whether a factor is ordered. You would use an ordered factor when the levels have meaningful order. For instance, in the above example, it would make sense that "Some" is better than "None", and "All" is better than "Some". To create an ordered factor, just specify ordered = TRUE in your function. By default, the factor will be ordered in the order the values appear, unless you specify levels (see below).

To convert something to a factor, use the factor() function if you want to specify levels and labels, or as.factor() to do it for you:

factor(c("Some", "All"), levels = c("None", "Some", "All"))
## [1] Some All 
## Levels: None Some All
factor(c("Some", "All"), levels = c("None", "Some", "All"), ordered = TRUE)
## [1] Some All 
## Levels: None < Some < All
as.factor(c("Some", "All"))
## [1] Some All 
## Levels: All Some

Notice the difference in the output of those three lines. The first allows us to specify the levels (i.e. the values that were possible). The second does the same but we also specify the ordering of the levels, and the third just converts the provided values and generates the levels based on that data.

Note: An important change in R version 4.0.0 is that R will no longer automatically convert strings (characters) to factors when you import data using data.frame() or read.table(). Prior to 4.0.0, it would automatically convert strings to characters unless otherwise specified.

3.3.5.1 Converting from factors

Sometimes you'll need to convert data from a factor to something else, usually a character. This is fairly straightforward using the tools we've already seen:

as.character(factor(c("Some", "None", "All")))
## [1] "Some" "None" "All"

3.3.6 Dates

Dates in any language are tricky. Different countries store dates in different formats and different bits of software store dates in different ways (looking at you Excel). This can make storing values as dates tough.

The most common way of creating a date is to use the as.Date() function. To use this function, you just need to provide your date as a character string:

as.Date("2019/01/01")
## [1] "2019-01-01"

But Adam, how does R know which one is the month and which is the day? Good question, thank you for asking. By default, R expects your character string to be in the order "Year/Month/Day". If you don't provide it in that format, you'll get a nonsense output:

as.Date("01/12/2019")
## [1] "1-12-20"

If your data is in a different format however, you can specify the format:

as.Date("01/12/2019", format = "%d/%m/%Y")
## [1] "2019-12-01"

Here, we're telling R that the string is in the format "Day/Month/Year". A list of the different codes that can be used in the format parameter can be found here, or by typing "R date codes" into Google.

Because nothing in life is simple, sometimes you'll get some data that has the date stored as a number. This is because the source of that data has the date stored as the number of days that have passed since an origin date. Because it's a number, our as.Date(..., format = ...) doesn't work. Instead, we can still use the as.Date() function, but we need to specify what the origin date is that the number refers to.

By default, when importing from Excel in Windows, the origin date is December 31st 1899. More commonly, the date January 1st 1970 (also known as the epoch date) is used.

Anyway, to specify your origin, we use the origin parameter, like this:

as.Date(18262, origin = "1970/01/01")
## [1] "2020-01-01"

Notice the format I've provided the origin in. It's the same as the default that R expects, and I would recommend copying that format wherever possible. If you're someone who just wants to watch the world burn, then you can specify a format for your origin as well...'

as.Date(18262, origin = as.Date("01/01/1970", format = "%d/%m/%Y"))
## [1] "2020-01-01"

but where's the humanity in that?

Testing whether something is a date is not as simple as the other data types unfortunately. Instead, we just use the is() or class() functions. If the first value returned is "Date", then you know it's a date:

is(as.Date("2020/01/01"))
## [1] "Date"     "oldClass"
class(as.Date("2020/01/01"))
## [1] "Date"

3.3.7 Datetimes (POSIXct)

If you thought dates were annoying, datetimes are like dates' little brother who didn't get enough attention as a child and so acts up all the time. One of the reasons for this is that datetimes aren't actually called datetimes. They're called POSIXct in R. So whenever you see that dreadful word, just remember "ah, Adam told me that means datetime" and you'll be fine.

Another thing that makes datetimes tough is that in addition to dates, datetimes (as you may have guessed) also store the time. The issue with that is that time is a more relative concept - there are lots of different time zones, so how do you know which one you're referring to? By default, R has a locale for where you currently are and will use that location for your timezone. You override that default using the Sys.setlocale() function, or you can use the tz parameter when creating your datetime as we'll see below.

With these annoyances aside however, creating datetimes isn't all that different to creating dates except that we use the as.POSIXct() function instead. We just provide a character string (with a format specification if necessary), or a number with an origin. One important departure from dates though, is that now our origin is in seconds, not days, to allow us to calculate the time.

as.POSIXct("2020/01/01 12:00:00")
## [1] "2020-01-01 12:00:00 UTC"
as.POSIXct("2020/01/01 12:00:00", tz = "NZ")
## [1] "2020-01-01 12:00:00 NZDT"
as.POSIXct(1577880000, origin = "1970/01/01")
## [1] "2020-01-01 12:00:00 UTC"

Similar to dates, there is no as.POSIXct() function in base R, so we use the is() and class() functions instead:

is(as.POSIXct("2020/01/01 12:00:00"))
## [1] "POSIXct"  "POSIXt"   "oldClass"
class(as.POSIXct("2020/01/01 12:00:00"))
## [1] "POSIXct" "POSIXt"

3.3.8 NA and NULL

The R language has two closely related values, NA and NULL.

NULL indicates the absence of a value. It means that a value is missing (or has length zero). A null value has no 'type' because it represents an absence of something, so passing a null value to any of the is.[type]() functions will return FALSE. Instead, checking whether a value is NULL is done with the is.null() function:

is.character(NULL)
## [1] FALSE
is.null(NULL)
## [1] TRUE
"NULL" == NULL
## logical(0)

On the other hand, NA (not available) represents an invalid value. This most often occurs when you try to convert one datatype to another where R can't assign an appropriate value.

For example, attempting to parse a character string to a date format that doesn't match will result in a NA value:

as.Date("10/01/20", format = "%m%Y%d")
## [1] NA

R has tried to parse a value from one type into another. The value isn't NULL because it clearly isn't missing, but it couldn't be converted to a date type, so it's NA. Unlike NULL, there are different NA values for each datatype (although they'll all look like NA in the console). In the example above, we actually created a NA that has the Date type:

errant_date <- as.Date("10/01/20", format = "%m%Y%d")
is(errant_date)
## [1] "Date"     "oldClass"

This is because R knows what type the NA should be, but it couldn't assign in a proper value.

The same behaviour can be observed for other data types:

as.numeric("not a number")
## Warning: NAs introduced by coercion
## [1] NA
as.logical("not a logical")
## [1] NA
is(as.numeric("not a number"))
## Warning in is(as.numeric("not a number")): NAs introduced by coercion
## [1] "numeric" "vector"
is(as.logical("not a logical"))
## [1] "logical" "vector"

To test whether something is NA, we use the is.na() function. You don't need to worry about what type the NA is, this will test if it is an NA of any type.

is.na(NA_character_) # this will be an NA that is of type 'character'
## [1] TRUE
is.na(NA_integer_) # this will be an NA that is of type 'integer'
## [1] TRUE

3.3.8.1 Dealing with NAs

Dealing with NAs is often contextual. Attempting to perform a mathematical calculation on a vector of values that contains at least one NA will often return NA:

sum(1,2,NA)
## [1] NA
mean(c(1,NA))
## [1] NA
NA + 1
## [1] NA

In some cases, those NAs will represent real issues with the values and so removing the NAs or converting them to 0 will just mask the error without fixing it.

Alternatively, data imports can often return NA values because of differing data types or similar and so converting those values to 0 or removing them outright may be appropriate.

Ultimately, how you deal with NA values is a question that you'll need to answer when it happens and depending on the situation. I will give you a helping hand though and say that if you want to just remove the NA values when summing or calculating an average or similar, then these functions often have an na.rm parameter than can be used to remove the NA values from the supplied list of values:

sum(1,2,NA, na.rm = TRUE)
## [1] 3

3.3.9 NaN and Infinity

A special case is NaN (not a number). NaNs are distinct to NA in that they represent a valid value. More precisely, NaN represents not real numbers (numeric values that cannot be represented with numbers). For example, dividing 0/0.

Inf and -Inf are similar constructs. They represent infinity and minus infinity respectively. They are valid values but they are not representable with numbers, so they have their own reserved words.

NaN, Inf and -Inf are all of the numeric type and do not have equivalent values in other data types.

3.3.10 Questions

  1. Why are 2 and 2L different?
  2. What is an ordered factor and how is it different to a character string?
  3. Why does as.Date("19/01/2019", format = "%d/%m/%y") return the date 19th Jan 2020 and not 19th Jan 2019?
  4. Why does as.numeric(TRUE) return 1? What will as.logical(2) return?