library(readr) # read and write tabular data
library(dplyr) # manipulate data
library(here) # file paths
library(tibble) # tibbles are updated version of dataframes
Understanding data
Questions
- How does R store and represent data?
Objectives
- Understand data types and missing values
- Learn about data structures vectors and data.frame
We started the previous lessons with read_csv()
. To better understand the data returned by read_csv()
, we will learn how R represents and stores data.
<- read_csv(here('data/cleaned/cnc-los-angeles-observations.csv')) inat_data
Let’s look at the data information returned by glimpse()
.
glimpse(inat_data)
Rows: 191,638
Columns: 37
$ id <dbl> 2931940, 2934641, 2934961, 2934980, 2934994…
$ observed_on <date> 2016-04-14, 2016-04-14, 2016-04-14, 2016-0…
$ time_observed_at <chr> "2016-04-14 19:25:00 UTC", "2016-04-14 19:0…
$ user_id <dbl> 151043, 10814, 80445, 80445, 80445, 121033,…
$ user_login <chr> "msmorales", "smartrf", "cdegroof", "cdegro…
$ user_name <chr> "Michael Morales", "Richard Smart (he, him)…
$ created_at <chr> "2016-04-14 07:28:36 UTC", "2016-04-14 19:0…
$ updated_at <chr> "2021-12-26 06:58:04 UTC", "2018-05-28 02:0…
$ quality_grade <chr> "research", "needs_id", "research", "resear…
$ license <chr> "CC-BY", "CC-BY-NC", NA, NA, NA, "CC-BY-NC"…
$ url <chr> "http://www.inaturalist.org/observations/29…
$ image_url <chr> "https://inaturalist-open-data.s3.amazonaws…
$ sound_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ tag_list <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ description <chr> "Spotted on a the wall of a planter, while …
$ captive_cultivated <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ latitude <dbl> 34.05829, 34.01742, 34.13020, 34.13143, 34.…
$ longitude <dbl> -117.8219, -118.2892, -118.8226, -118.8215,…
$ positional_accuracy <dbl> 4, 5, NA, NA, NA, NA, 17, 55, 55, 55, NA, 5…
$ public_positional_accuracy <dbl> 4, 5, NA, NA, NA, NA, 17, 55, 55, 55, NA, 5…
$ geoprivacy <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxon_geoprivacy <chr> NA, NA, NA, "open", "open", NA, "open", NA,…
$ coordinates_obscured <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ scientific_name <chr> "Cornu aspersum", "Oestroidea", "Arphia ram…
$ common_name <chr> "Garden Snail", "Bot Flies, Blow Flies, and…
$ iconic_taxon_name <chr> "Mollusca", "Insecta", "Insecta", "Reptilia…
$ taxon_id <dbl> 480298, 356157, 54247, 36100, 36204, 69731,…
$ taxon_kingdom_name <chr> "Animalia", "Animalia", "Animalia", "Animal…
$ taxon_phylum_name <chr> "Mollusca", "Arthropoda", "Arthropoda", "Ch…
$ taxon_class_name <chr> "Gastropoda", "Insecta", "Insecta", "Reptil…
$ taxon_order_name <chr> "Stylommatophora", "Diptera", "Orthoptera",…
$ taxon_family_name <chr> "Helicidae", NA, "Acrididae", "Phrynosomati…
$ taxon_genus_name <chr> "Cornu", NA, "Arphia", "Uta", "Sceloporus",…
$ taxon_species_name <chr> "Cornu aspersum", NA, "Arphia ramona", "Uta…
$ taxon_subspecies_name <chr> NA, NA, NA, "Uta stansburiana elegans", NA,…
$ threatened <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ establishment_means <chr> "introduced", NA, "native", "native", "nati…
Data Types
glimpse()
shows <dbl> <date> <chr> <lgl>
. Those are data types.
In computer programming, data type is a way to group data values. Every value has a data type. An analogy is human languages have numbers and words. A value’s data type will determine what the programming language can do with the value. For instance, in R we can add numbers but we can’t add words.
Adding numbers if fine.
1 + 2
[1] 3
Adding words causes an error.
"cat" + "dogs"
There are 4 main types in R: numeric, integer, logical, and character.
numeric are numbers that contain a decimal (e.g. 1.2, 10.5). By default, R also treats whole numbers as decimals (e.g. 1, 10).
integer are whole numbers that do not have a decimal point. (e.g. 1L, 10L). In R, the
L
suffix forces the number to be an integer, since by default R uses decimal numbers.logical have values of
TRUE
orFALSE
.character are strings of characters (e.g. “abc”, ‘dog’). Characters are mainly letters and punctuation. Numbers combined with letters are treated as characters such as ‘1apple’. Strings must be surrounded by quotes, either single quotes or double quotes.
Data Structures
A data structure is a way to organize and store a collection of values.
Vectors
A vector is data structure in R that has series of values. All the value in the vector must be the same data type.
To create a vector we use the c()
combine function, and pass in the values as arguments.
We can use class()
function to find the type or class of any object.
numeric vector
<- c(1, 2, 5)
numbers numbers
[1] 1 2 5
class(numbers)
[1] "numeric"
character vector
<- c("apple", 'pear', "grape")
characters characters
[1] "apple" "pear" "grape"
class(characters)
[1] "character"
logical vector
<- c(TRUE, FALSE, TRUE)
logicals logicals
[1] TRUE FALSE TRUE
class(logicals)
[1] "logical"
If you try to put values of different data types into a vector, all the values are converted to the same data type. In the following example, everything is converted to character type.
<- c(1, "apple", TRUE)
mixed mixed
[1] "1" "apple" "TRUE"
class(mixed)
[1] "character"
data.frame
data.frame is a data structure from R that is used to represent tabular data with rows and columns. Each column in a data.frame is a vector. Because each column is a vector, all the values in a column must be of the same data type.
We can create a data.frame using the previous vectors using data.frame()
. For each column, we give the column a name and a vector.
<- data.frame(Numbers = numbers, Characters = characters)
df df
Numbers Characters
1 1 apple
2 2 pear
3 5 grape
When we call class()
on a data.frame, it returns “data.frame”
class(df)
[1] "data.frame"
Tibble is an updated version of data.frame from the tibble package.
We can create a tibble using the previous vectors using tibble()
.
<- tibble(Numbers = numbers, Characters = characters)
tb tb
# A tibble: 3 × 2
Numbers Characters
<dbl> <chr>
1 1 apple
2 2 pear
3 5 grape
When we call class()
on a tibble, it returns “tbl_df” (tibble data.frame), “tbl” (tibble), and “data.frame”.
class(tb)
[1] "tbl_df" "tbl" "data.frame"
readr returns results as special type of tibble. When we call class()
with inat_data
, it returns “spec_tbl_df” (specification tibble data.frame), “tbl_df”, “tbl”, and “data.frame”.
class(inat_data)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
When readr
loads a CSV, it tries to figure the data type for each column. For inat_data
, id
is numeric, captive_cultivated
is logical, user_login
is character. Multiple columns have NA
such as license
and sound_url
.
readr
has a special data type date
that is used to represent dates. Column observed_on
is date. Other columns are such as time_observed_at
are treated as character because those strings have extra information that readr
does not recognize as a date.
Missing data
In tabular data, there are times when a record does not have a value for a particular field. In spreadsheet programs, when there is no value, cells are left blank. R represents missing values as NA
, without quotes. NA
stands for not applicable.
NA
is allowed in vectors of any data type.
<- c(1, 2, NA)
numbers numbers
[1] 1 2 NA
class(numbers)
[1] "numeric"
When you pass number vectors with NA
to a math function like min()
, the function returns NA
. min()
returns the smallest number in a vector.
min(numbers)
[1] NA
Many math functions have an argument na.rm
to remove NA
values for its calculations.
min(numbers, na.rm = TRUE)
[1] 1