Working with data

Questions

How do you work with iNaturalist CSV data in R?

Objectives

Import CSV data into R.
Select rows and columns of data.frames.
Use pipes to link steps together into pipelines.
Create new data.frame columns using existing columns.
Export data to a CSV file.

Exploring iNaturalist data

A CSV of iNaturalist observations for City Nature Challenge Los Angeles from 2015 to 2024 is located at “data/cleaned/cnc-los-angeles-observations.csv”. We are going to read that CSV using R.

Functions

Functions are predefined bits of code that do a specific task. Arguments are values that we pass into a function. Function usually takes one or more arguments as input, does something to the values, and produces the ouput.

R packages

R itself has many built-in functions, but we can access many more by installing and loading other packages of functions and data into R. We will use several R packages for the workshop.

To install these packages, use install.packages() function from R. We pass in the package names as arguments. The name of the packages must be in quotes.

install.packages("readr")

R will connect to the internet and download packages from servers that have R packages. R will then install the packages on your computer. The console window will show you the progress of the installation process.

To save time, we have already installed all the packages we need for the workshop.

In order to use a package, use library() function from R to load the package. We pass in the name of the package as an argument. Do not use quotes around the package name when using library().

library(readr)

Reading a CSV file

In order to analyze the iNaturalist csv, we need to load readr, lubridate, dplyr, and here packages.

Generally it is a good idea to list all the libraries that you will use in the script at the beginning of the script. You want to install a package to your computer once, and then load it with library() in each script where you need to use it.

library(readr) # read and write tabular data
library(dplyr) # manipulate data
library(lubridate) # manipulate dates
library(here) # file paths
library(stringr) # work with string

File paths

When we reference other files from an R script, we need to give R precise instructions on where those files are. We do that using something called a file path.

There are two kinds of paths: absolute and relative. Absolute paths are specific to a particular computer, whereas relative paths are relative to a certain folder. Because we are using RStudio “project” feature, all of our paths is relative to the project folder. For instance an absolute path is “/Users/username/Documents/CNC-coding-workshop/data/cleaned/cnc-los-angeles-observations.csv”, and relative path is “data/cleaned/cnc-los-angeles-observations.csv”.

here is an R package that makes it easier to handle file paths.

We call read_csv() function from readr, and pass in a relative path to a CSV file in order to load the CSV.

read_csv() will read the file and return the content of the file as data.frame. data.frame is how R handles data with rows and columns. In order for us access the content later on, we will assign the content to an object called inat_data.

inat_data <- read_csv(here('data/cleaned/cnc-los-angeles-observations.csv'))

We can use the glimpse() function from dplyr get a summary about the contents of inat_data. It shows the number of rows and columns. For each column, it shows the name, data type (dbl, chr, lgl, date), and the first few values.

glimpse(inat_data)

Rows: 191,638
Columns: 37
$ id                         <dbl> 2931940, 2934641, 2934961, 2934980, 2934994…
$ observed_on                <date> 2016-04-14, 2016-04-14, 2016-04-14, 2016-0…
$ time_observed_at           <chr> "2016-04-14 19:25:00 UTC", "2016-04-14 19:0…
$ user_id                    <dbl> 151043, 10814, 80445, 80445, 80445, 121033,…
$ user_login                 <chr> "msmorales", "smartrf", "cdegroof", "cdegro…
$ user_name                  <chr> "Michael Morales", "Richard Smart (he, him)…
$ created_at                 <chr> "2016-04-14 07:28:36 UTC", "2016-04-14 19:0…
$ updated_at                 <chr> "2021-12-26 06:58:04 UTC", "2018-05-28 02:0…
$ quality_grade              <chr> "research", "needs_id", "research", "resear…
$ license                    <chr> "CC-BY", "CC-BY-NC", NA, NA, NA, "CC-BY-NC"…
$ url                        <chr> "http://www.inaturalist.org/observations/29…
$ image_url                  <chr> "https://inaturalist-open-data.s3.amazonaws…
$ sound_url                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ tag_list                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ description                <chr> "Spotted on a the wall of a planter, while …
$ captive_cultivated         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ latitude                   <dbl> 34.05829, 34.01742, 34.13020, 34.13143, 34.…
$ longitude                  <dbl> -117.8219, -118.2892, -118.8226, -118.8215,…
$ positional_accuracy        <dbl> 4, 5, NA, NA, NA, NA, 17, 55, 55, 55, NA, 5…
$ public_positional_accuracy <dbl> 4, 5, NA, NA, NA, NA, 17, 55, 55, 55, NA, 5…
$ geoprivacy                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxon_geoprivacy           <chr> NA, NA, NA, "open", "open", NA, "open", NA,…
$ coordinates_obscured       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ scientific_name            <chr> "Cornu aspersum", "Oestroidea", "Arphia ram…
$ common_name                <chr> "Garden Snail", "Bot Flies, Blow Flies, and…
$ iconic_taxon_name          <chr> "Mollusca", "Insecta", "Insecta", "Reptilia…
$ taxon_id                   <dbl> 480298, 356157, 54247, 36100, 36204, 69731,…
$ taxon_kingdom_name         <chr> "Animalia", "Animalia", "Animalia", "Animal…
$ taxon_phylum_name          <chr> "Mollusca", "Arthropoda", "Arthropoda", "Ch…
$ taxon_class_name           <chr> "Gastropoda", "Insecta", "Insecta", "Reptil…
$ taxon_order_name           <chr> "Stylommatophora", "Diptera", "Orthoptera",…
$ taxon_family_name          <chr> "Helicidae", NA, "Acrididae", "Phrynosomati…
$ taxon_genus_name           <chr> "Cornu", NA, "Arphia", "Uta", "Sceloporus",…
$ taxon_species_name         <chr> "Cornu aspersum", NA, "Arphia ramona", "Uta…
$ taxon_subspecies_name      <chr> NA, NA, NA, "Uta stansburiana elegans", NA,…
$ threatened                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ establishment_means        <chr> "introduced", NA, "native", "native", "nati…

We can view the first six rows with the head() function, and the last six rows with the tail() function:

head(inat_data)

# A tibble: 6 × 37
       id observed_on time_observed_at   user_id user_login user_name created_at
    <dbl> <date>      <chr>                <dbl> <chr>      <chr>     <chr>     
1 2931940 2016-04-14  2016-04-14 19:25:…  151043 msmorales  Michael … 2016-04-1…
2 2934641 2016-04-14  2016-04-14 19:02:…   10814 smartrf    Richard … 2016-04-1…
3 2934961 2016-04-14  2016-04-14 19:15:…   80445 cdegroof   Chris De… 2016-04-1…
4 2934980 2016-04-14  2016-04-14 19:18:…   80445 cdegroof   Chris De… 2016-04-1…
5 2934994 2016-04-14  2016-04-14 19:19:…   80445 cdegroof   Chris De… 2016-04-1…
6 2935037 2016-04-14  2016-04-14 19:36:…  121033 ttempel    <NA>      2016-04-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
#   url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
#   description <chr>, captive_cultivated <lgl>, latitude <dbl>,
#   longitude <dbl>, positional_accuracy <dbl>,
#   public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
#   coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
#   iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …

tail(inat_data)

# A tibble: 6 × 37
         id observed_on time_observed_at user_id user_login user_name created_at
      <dbl> <date>      <chr>              <dbl> <chr>      <chr>     <chr>     
1 254128969 2024-04-28  2024-04-28 17:1… 2834615 thannavic… Thanna V… 2024-12-0…
2 255041807 2024-04-26  2024-04-26 23:3… 5347031 epiphyte78 <NA>      2024-12-1…
3 255041881 2024-04-26  2024-04-26 22:1… 5347031 epiphyte78 <NA>      2024-12-1…
4 255041985 2024-04-26  2024-04-26 22:1… 5347031 epiphyte78 <NA>      2024-12-1…
5 255042063 2024-04-26  2024-04-26 20:4… 5347031 epiphyte78 <NA>      2024-12-1…
6 255042124 2024-04-26  2024-04-26 19:1… 5347031 epiphyte78 <NA>      2024-12-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
#   url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
#   description <chr>, captive_cultivated <lgl>, latitude <dbl>,
#   longitude <dbl>, positional_accuracy <dbl>,
#   public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
#   coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
#   iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …

You can use View() function from R to open an interactive viewer, which behaves like a simplified version of a spreadsheet program. If you hover over the tab for the interactive View(), you can click the “x” that appears, which will close the tab.

View(inat_data)

You can use names() from R to see the fields in the data frame.

names(inat_data)

 [1] "id"                         "observed_on"               
 [3] "time_observed_at"           "user_id"                   
 [5] "user_login"                 "user_name"                 
 [7] "created_at"                 "updated_at"                
 [9] "quality_grade"              "license"                   
[11] "url"                        "image_url"                 
[13] "sound_url"                  "tag_list"                  
[15] "description"                "captive_cultivated"        
[17] "latitude"                   "longitude"                 
[19] "positional_accuracy"        "public_positional_accuracy"
[21] "geoprivacy"                 "taxon_geoprivacy"          
[23] "coordinates_obscured"       "scientific_name"           
[25] "common_name"                "iconic_taxon_name"         
[27] "taxon_id"                   "taxon_kingdom_name"        
[29] "taxon_phylum_name"          "taxon_class_name"          
[31] "taxon_order_name"           "taxon_family_name"         
[33] "taxon_genus_name"           "taxon_species_name"        
[35] "taxon_subspecies_name"      "threatened"                
[37] "establishment_means"

We can use dim() dimension function from R to get the dimension of a data frame. It returns the number of rows and number of columns.

dim(inat_data)

[1] 191638     37

inat_data has over 193K rows and 37 columns.

More about functions

To learn more about a function, you can type a ? in front of the name of the function, which will bring up the official documentation for that function:

?head

Function documentation is written by the authors of the functions, so they can vary pretty widely in their style and readability. The first section, Description, gives you a concise description of what the function does, but it may not always be enough. The Arguments section defines all the arguments for the function and is usually worth reading thoroughly. Finally, the Examples section at the end will often have some helpful examples that you can run to get a sense of what the function is doing.

The help Arguments section for head() shows four arguments. The first argument x is required, the rest are optional. For example, the n argument in head() specifies the number of rows to print. It defaults to 6, but we can override that by specifying a different number:

head(x = inat_data, n = 10)

# A tibble: 10 × 37
        id observed_on time_observed_at  user_id user_login user_name created_at
     <dbl> <date>      <chr>               <dbl> <chr>      <chr>     <chr>     
 1 2931940 2016-04-14  2016-04-14 19:25…  151043 msmorales  Michael … 2016-04-1…
 2 2934641 2016-04-14  2016-04-14 19:02…   10814 smartrf    Richard … 2016-04-1…
 3 2934961 2016-04-14  2016-04-14 19:15…   80445 cdegroof   Chris De… 2016-04-1…
 4 2934980 2016-04-14  2016-04-14 19:18…   80445 cdegroof   Chris De… 2016-04-1…
 5 2934994 2016-04-14  2016-04-14 19:19…   80445 cdegroof   Chris De… 2016-04-1…
 6 2935037 2016-04-14  2016-04-14 19:36…  121033 ttempel    <NA>      2016-04-1…
 7 2935117 2016-04-15  <NA>                76855 bradrumble <NA>      2016-04-1…
 8 2935139 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
 9 2935176 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
10 2935181 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
#   url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
#   description <chr>, captive_cultivated <lgl>, latitude <dbl>,
#   longitude <dbl>, positional_accuracy <dbl>,
#   public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
#   coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
#   iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …

If we order the argument the same order they are listed in help Arguments section, we don’t have to name them:

head(inat_data, 10)

# A tibble: 10 × 37
        id observed_on time_observed_at  user_id user_login user_name created_at
     <dbl> <date>      <chr>               <dbl> <chr>      <chr>     <chr>     
 1 2931940 2016-04-14  2016-04-14 19:25…  151043 msmorales  Michael … 2016-04-1…
 2 2934641 2016-04-14  2016-04-14 19:02…   10814 smartrf    Richard … 2016-04-1…
 3 2934961 2016-04-14  2016-04-14 19:15…   80445 cdegroof   Chris De… 2016-04-1…
 4 2934980 2016-04-14  2016-04-14 19:18…   80445 cdegroof   Chris De… 2016-04-1…
 5 2934994 2016-04-14  2016-04-14 19:19…   80445 cdegroof   Chris De… 2016-04-1…
 6 2935037 2016-04-14  2016-04-14 19:36…  121033 ttempel    <NA>      2016-04-1…
 7 2935117 2016-04-15  <NA>                76855 bradrumble <NA>      2016-04-1…
 8 2935139 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
 9 2935176 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
10 2935181 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
#   url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
#   description <chr>, captive_cultivated <lgl>, latitude <dbl>,
#   longitude <dbl>, positional_accuracy <dbl>,
#   public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
#   coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
#   iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …

Additionally, if we name them, we can put them in any order we want:

head(n = 10, x = inat_data)

# A tibble: 10 × 37
        id observed_on time_observed_at  user_id user_login user_name created_at
     <dbl> <date>      <chr>               <dbl> <chr>      <chr>     <chr>     
 1 2931940 2016-04-14  2016-04-14 19:25…  151043 msmorales  Michael … 2016-04-1…
 2 2934641 2016-04-14  2016-04-14 19:02…   10814 smartrf    Richard … 2016-04-1…
 3 2934961 2016-04-14  2016-04-14 19:15…   80445 cdegroof   Chris De… 2016-04-1…
 4 2934980 2016-04-14  2016-04-14 19:18…   80445 cdegroof   Chris De… 2016-04-1…
 5 2934994 2016-04-14  2016-04-14 19:19…   80445 cdegroof   Chris De… 2016-04-1…
 6 2935037 2016-04-14  2016-04-14 19:36…  121033 ttempel    <NA>      2016-04-1…
 7 2935117 2016-04-15  <NA>                76855 bradrumble <NA>      2016-04-1…
 8 2935139 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
 9 2935176 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
10 2935181 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
#   url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
#   description <chr>, captive_cultivated <lgl>, latitude <dbl>,
#   longitude <dbl>, positional_accuracy <dbl>,
#   public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
#   coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
#   iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …

Manipulating data

One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data. The dplyr package provide a series of powerful functions for many common data manipulation tasks.

select()

The select() function is used to select certain columns of a data frame. The first argument is the data frame, and the rest of the arguments are unquoted names of the columns you want.

Our inat_data data frame has 37 columns. We want four columns: user_login, common_name, scientific_name, observed_on.

select(inat_data, user_login, common_name, scientific_name, observed_on)

# A tibble: 191,638 × 4
   user_login    common_name                         scientific_name observed_on
   <chr>         <chr>                               <chr>           <date>     
 1 msmorales     Garden Snail                        Cornu aspersum  2016-04-14 
 2 smartrf       Bot Flies, Blow Flies, and Allies   Oestroidea      2016-04-14 
 3 cdegroof      California Orange-winged Grasshopp… Arphia ramona   2016-04-14 
 4 cdegroof      Western Side-blotched Lizard        Uta stansburia… 2016-04-14 
 5 cdegroof      Western Fence Lizard                Sceloporus occ… 2016-04-14 
 6 ttempel       <NA>                                Coelocnemis     2016-04-14 
 7 bradrumble    House Sparrow                       Passer domesti… 2016-04-15 
 8 deedeeflower5 Amur Carp                           Cyprinus rubro… 2016-04-14 
 9 deedeeflower5 Red-eared Slider                    Trachemys scri… 2016-04-14 
10 deedeeflower5 Mallard                             Anas platyrhyn… 2016-04-14 
# ℹ 191,628 more rows

select() creates a new data frame with 193K rows, and 4 columns.

filter()

The filter() function is used to select rows that match certain criteria. The first argument is the name of the data frame, and the second argument is the selection criteria.

select observations by common_name

Let’s find all the observations for ‘Western Fence Lizard’, the most popular species in CNC Los Angeles. We want all the rows where common_name is equal to ‘Western Fence Lizard’. Use == to test for equality.

filter(inat_data, common_name == 'Western Fence Lizard')

# A tibble: 3,339 × 37
        id observed_on time_observed_at  user_id user_login user_name created_at
     <dbl> <date>      <chr>               <dbl> <chr>      <chr>     <chr>     
 1 2934994 2016-04-14  2016-04-14 19:19…   80445 cdegroof   Chris De… 2016-04-1…
 2 2935263 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
 3 2935420 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
 4 2935748 2016-04-14  2016-04-14 21:01…   80445 cdegroof   Chris De… 2016-04-1…
 5 2935965 2016-04-14  2016-04-14 19:44…  171443 lchroman   <NA>      2016-04-1…
 6 2938607 2016-04-14  2016-04-14 23:33…  146517 maiz       <NA>      2016-04-1…
 7 2940103 2016-04-15  2016-04-15 16:31…   80984 kimssight  Kim Moore 2016-04-1…
 8 2940838 2016-04-15  2016-04-15 17:11…  201119 sarahwenn… <NA>      2016-04-1…
 9 2940848 2016-04-15  2016-04-15 17:17…  201119 sarahwenn… <NA>      2016-04-1…
10 2940855 2016-04-15  2016-04-15 17:42…  201119 sarahwenn… <NA>      2016-04-1…
# ℹ 3,329 more rows
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
#   url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
#   description <chr>, captive_cultivated <lgl>, latitude <dbl>,
#   longitude <dbl>, positional_accuracy <dbl>,
#   public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
#   coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>, …

filter() creates a new data frame with 3,339 rows, and 37 columns.

Keep in mind that species can have zero to multiple common names. If you use want to search by common name, you need to use the exact common name that iNaturalist uses.

select observations by scientific_name

Let’s find all the observations for ‘Sceloporus occidentalis’, the Latin scientific name for ‘Western Fence Lizard’.

filter(inat_data, scientific_name == 'Sceloporus occidentalis')

# A tibble: 3,339 × 37
        id observed_on time_observed_at  user_id user_login user_name created_at
     <dbl> <date>      <chr>               <dbl> <chr>      <chr>     <chr>     
 1 2934994 2016-04-14  2016-04-14 19:19…   80445 cdegroof   Chris De… 2016-04-1…
 2 2935263 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
 3 2935420 2016-04-14  <NA>               216108 deedeeflo… <NA>      2016-04-1…
 4 2935748 2016-04-14  2016-04-14 21:01…   80445 cdegroof   Chris De… 2016-04-1…
 5 2935965 2016-04-14  2016-04-14 19:44…  171443 lchroman   <NA>      2016-04-1…
 6 2938607 2016-04-14  2016-04-14 23:33…  146517 maiz       <NA>      2016-04-1…
 7 2940103 2016-04-15  2016-04-15 16:31…   80984 kimssight  Kim Moore 2016-04-1…
 8 2940838 2016-04-15  2016-04-15 17:11…  201119 sarahwenn… <NA>      2016-04-1…
 9 2940848 2016-04-15  2016-04-15 17:17…  201119 sarahwenn… <NA>      2016-04-1…
10 2940855 2016-04-15  2016-04-15 17:42…  201119 sarahwenn… <NA>      2016-04-1…
# ℹ 3,329 more rows
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
#   url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
#   description <chr>, captive_cultivated <lgl>, latitude <dbl>,
#   longitude <dbl>, positional_accuracy <dbl>,
#   public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
#   coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>, …

We get 3,339 rows, and 37 columns, the same as common_name == 'Western Fence Lizard'.

We will cover how to search for species more in the “Higher taxa” lesson.

The pipe: %>%

What happens if we want to select columns and filter rows?

We use the pipe operator %>% to call multiple functions.

Tip

You can insert %>% by using the keyboard shortcut Shift+Cmd+M (Mac) or Shift+Ctrl+M (Windows).

select observations by user_login

iNaturalist has two fields for the user name: user_login and user_name. iNaturalist displays the user_login for each observation, and displays user_name on the user’s profile page.

Let’s get all observations for iNaturalist user ‘natureinla’, and we only want columns user_login, common_name, scientific_name, observed_on. Since we need both filter() and select(), we use pipe operator %>%.

Pipe operator take the thing on the left hand side and insert it as the first argument of the function on the right hand side.

inat_data %>%
  filter(user_login == 'natureinla') %>%
  select(user_login, common_name, scientific_name, observed_on)

# A tibble: 2,956 × 4
   user_login common_name            scientific_name               observed_on
   <chr>      <chr>                  <chr>                         <date>     
 1 natureinla Red-eared Slider       Trachemys scripta elegans     2016-04-14 
 2 natureinla Monarch                Danaus plexippus              2016-04-14 
 3 natureinla San Diego Gopher Snake Pituophis catenifer annectens 2016-04-14 
 4 natureinla California Towhee      Melozone crissalis            2016-04-14 
 5 natureinla Cooper's Hawk          Astur cooperii                2016-04-14 
 6 natureinla Monarch                Danaus plexippus              2016-04-14 
 7 natureinla tropical milkweed      Asclepias curassavica         2016-04-14 
 8 natureinla Allen's Hummingbird    Selasphorus sasin             2016-04-14 
 9 natureinla Northern Mockingbird   Mimus polyglottos             2016-04-15 
10 natureinla House Sparrow          Passer domesticus             2016-04-15 
# ℹ 2,946 more rows

It can be helpful to think of %>% as meaning “and then”. inat_data is sent to filter() function. filter() selects rows with ‘natureinla’. And then the output from filter() is sent into the select() function. select() selects 4 columns.

select observations by coordinates_obscured

Sometimes the coordinates for iNaturalist observations are obscured. For instance, when the observation involves an endangered species, iNaturalist will automatically obscure the coordinates in order to protect the species. Sometimes people choose to obscure their location when they are making observations so that other people will not know their exact location. iNaturalist has information about obscured coordinates.

To access one column in a data frame, use dataframe$column_name.

inat_data$coordinates_obscured

When we pass in a data frame column to table() function from R, it will return the unique values in a column, and the number of rows that contain each value.

Use table() to get a count of how many observations have obscured locations by passing in the data frame column.

table(inat_data$coordinates_obscured)


 FALSE   TRUE 
176942  14696

176K row are false (coordinates are normal), 14K rows are true (coordinates are obscured).

If the exact location of the observation will affect your analysis, then you want unobscured coordinates. Let’s get the observations where the coordinates are not obscured.

inat_data %>%
  filter(coordinates_obscured == FALSE) %>%
  select(user_login, common_name, scientific_name, observed_on)

# A tibble: 176,942 × 4
   user_login    common_name                         scientific_name observed_on
   <chr>         <chr>                               <chr>           <date>     
 1 msmorales     Garden Snail                        Cornu aspersum  2016-04-14 
 2 smartrf       Bot Flies, Blow Flies, and Allies   Oestroidea      2016-04-14 
 3 cdegroof      California Orange-winged Grasshopp… Arphia ramona   2016-04-14 
 4 cdegroof      Western Side-blotched Lizard        Uta stansburia… 2016-04-14 
 5 cdegroof      Western Fence Lizard                Sceloporus occ… 2016-04-14 
 6 ttempel       <NA>                                Coelocnemis     2016-04-14 
 7 bradrumble    House Sparrow                       Passer domesti… 2016-04-15 
 8 deedeeflower5 Amur Carp                           Cyprinus rubro… 2016-04-14 
 9 deedeeflower5 Red-eared Slider                    Trachemys scri… 2016-04-14 
10 deedeeflower5 Mallard                             Anas platyrhyn… 2016-04-14 
# ℹ 176,932 more rows

Tip

When using both filter() and select(), it is a good idea to use filter() before select(). The following code will cause an error “object ‘coordinates_obscured’ not found”.

inat_data %>%
  select(user_login, common_name, scientific_name, observed_on)  %>% 
  filter(coordinates_obscured == FALSE)

select() creates a data frame with four fields. When we try to filter() using coordinates_obscured, we get an error because the 4-field data frame we pass to filter() does not have the field coordinates_obscured.

select observations by quality_grade

iNaturalist gives a quality grade to each observation. The observations are labeled as ‘needs_id’, ‘research’, or ‘casual’. iNaturalist FAQ about quality grade.

To see all the unique values for a column, use unique() function from R and pass in the data frame column.

unique(inat_data$quality_grade)

[1] "research" "needs_id" "casual"

When researchers use iNaturalist data, they normally use research grade observations. Let’s get the observations that are research grade.

inat_data %>%
  filter(quality_grade == 'research')  %>%
  select(user_login, common_name, scientific_name, observed_on)

# A tibble: 107,491 × 4
   user_login    common_name                         scientific_name observed_on
   <chr>         <chr>                               <chr>           <date>     
 1 msmorales     Garden Snail                        Cornu aspersum  2016-04-14 
 2 cdegroof      California Orange-winged Grasshopp… Arphia ramona   2016-04-14 
 3 cdegroof      Western Side-blotched Lizard        Uta stansburia… 2016-04-14 
 4 cdegroof      Western Fence Lizard                Sceloporus occ… 2016-04-14 
 5 deedeeflower5 Red-eared Slider                    Trachemys scri… 2016-04-14 
 6 deedeeflower5 Mallard                             Anas platyrhyn… 2016-04-14 
 7 lchroman      Cactus Wren                         Campylorhynchu… 2016-04-14 
 8 deedeeflower5 Desert Cottontail                   Sylvilagus aud… 2016-04-14 
 9 deedeeflower5 Western Fence Lizard                Sceloporus occ… 2016-04-14 
10 deedeeflower5 Eastern Fox Squirrel                Sciurus niger   2016-04-14 
# ℹ 107,481 more rows

Errors in code

We are writing instructions for the computer. If there is a typo, misspelling, wrong function arguments, etc, the code will not work and we will see errors. R will display the errors in red. You need to fix the errors in order for the code to work. Here are some example errors.

typo: we used %>, when it should be %>%

inat_data %>
  select(user_login, observed_on, common_name)

Misspelled user_logi

inat_data %>%
  select(user_logi, observed_on, common_name)

typo: we use =, when it should be ==

inat_data %>%
  filter(user_login = 'natureinla')

typo: extra )

inat_data %>%
  select(user_login, observed_on, common_name))

Exercise 1

Get all your City Nature Challenge observations.

Use read_csv() to load the CNC CSV. Assign the results to my_inat_data object.
Use filter() to select observations with your iNaturalist username. If you don’t have any CNC observations, use ‘quantron’ the most prolific community scientist for CNC Los Angeles.
Use select() to select 4 columns. One of the columns should be common_name
assign the results of filter() and select() to my_obs object
click on my_obs in the Environment tab to see the results

my_inat_data <- read_csv(here('data/cleaned/cnc-los-angeles-observations.csv'))

Rows: 191638 Columns: 37
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (26): time_observed_at, user_login, user_name, created_at, updated_at, ...
dbl   (7): id, user_id, latitude, longitude, positional_accuracy, public_pos...
lgl   (3): captive_cultivated, coordinates_obscured, threatened
date  (1): observed_on

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

my_obs <- my_inat_data %>%
  filter(user_login == 'natureinla') %>%
  select(user_login, observed_on, common_name, scientific_name)

Logical operators

In previous examples we used one criteria in filter() to select some rows. Often times we want to use multiple criteria to select some rows. Logical operators allow you to do multiple comparisons at once.

and operator: &

If there are multiple criteria, and we want to get rows that match all of the criteria, we use and operator & in between the criteria.

condtion_1 & condition_2

select observations by common_name and quality_grade

Let’s get all ‘Western Fence Lizard’ observations that are research grade. This means we want to get rows where common_name is ‘Western Fence Lizard’ and quality_grade is ‘research’.

my_data <- inat_data %>%
  filter(common_name == 'Western Fence Lizard' & 
           quality_grade == 'research')  %>%
  select(user_login, common_name, scientific_name, observed_on, quality_grade)

View(my_data)

We can check the results to make sure we wrote we got the data we want. We can use unique() to check the column values.

unique(my_data$common_name)

[1] "Western Fence Lizard"

unique(my_data$quality_grade)

[1] "research"

select observations by coordinates_obscured and positional_accuracy

Previously we looked at coordinates_obscured. In addition to coordinates being intentionally obscured, another thing that can affect the coordinates for an observation is the accuracy of the coordinates. The accuracy of GPS on smart phones depends on the hardware, software, physical environment, etc. The positional_accuracy from iNaturalist measures the coordinates error in meters. For example if an observation has a positional accuracy of 65 meters, this means the measured coordinates is within 65 meters from the actual coordinates.

When given a column in a dataframe, summary() displays statistics about the values. Let’s use summary() to look at the positional accuracy of observations where the coordinates are not obscured.

my_data <- inat_data %>%
  filter(coordinates_obscured == FALSE)  

 
summary(my_data$positional_accuracy)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
       0        5       12     2070       65 13227987    36601

Min. means the minimal value is 0.

1st Qu. means 25% of the values are less than 5, and 75% are greater than 5.

Median means 50% of the values are less than 12, 50% are greater than 12.

Mean is the sum of the values divided by number of items.

3rd Qu. means 75% of the values are less than 65, and 25% are greater than 65.

Max. means the maximum value is 13,227,987.

NA’s means there are 36,601 rows without positional_accuracy.

Even though we selected unobscured locations, 25% of the observations are 65 or more meters away the actual location due to the accuracy of GPS device.

If location accuracy is important to your analysis, you can select a small number for position accuracy. Let’s get observations with unobscured locations that have position accuracy less than 5 meters.

my_data <- inat_data %>%
  filter(coordinates_obscured == FALSE & 
           positional_accuracy <= 5) %>%
  select(user_login, common_name, scientific_name, positional_accuracy, coordinates_obscured)

dim(my_data)

[1] 41417     5

We have 41K observations with position accuracy less than 5 meters.

unique(my_data$coordinates_obscured)

[1] FALSE

unique(my_data$positional_accuracy)

[1] 4 5 3 2 1 0

or operator: |

If there are multiple criteria, and we want to get rows that match one or more of the criteria, we use or operator | or in between the criteria.

condition_1 | condition_2

select observations by multiple common_name

Let’s get observations where common_name is ‘Western Fence Lizard’ or ‘Western Honey Bee’.

my_data <- inat_data %>%
  filter(common_name == 'Western Honey Bee' | 
        common_name == 'Western Fence Lizard')  %>%
  select(user_login, observed_on, common_name)

dim(my_data)

[1] 5399    3

unique(my_data$common_name)

[1] "Western Fence Lizard" "Western Honey Bee"

& (and) versus | (or)

& (and) return rows where all conditions are true. This code looks for observations where user_login is ‘natureinla’ and common_name is ‘Western Fence Lizard’.

and_data <- inat_data %>%
  filter(user_login == 'natureinla' & 
           common_name == 'Western Fence Lizard')

dim(and_data)

[1] 79 37

unique(and_data$user_login)

[1] "natureinla"

unique(and_data$common_name)

[1] "Western Fence Lizard"

We get 79 rows with 1 user_login and 1 common_name

| (or) returns rows where any conditions are true. This code looks for observations where user_login is ‘natureinla’ plus observations where common_name is ‘Western Fence Lizard’

or_data <- inat_data %>%
  filter(user_login == 'natureinla' | 
           common_name == 'Western Fence Lizard')

dim(or_data)

[1] 6216   37

unique(or_data$user_login) %>% length

[1] 1052

unique(or_data$common_name) %>% length

[1] 1031

We get 6,216 rows with 1052 user_login and 1031 common_name

& vs | will return different results. Check the results of your code to make sure your results matches what you intended.

%in% c()

Another way to get rows that match one or more of the criteria is with the in operator %in%.

Note

A vector is way R stores multiple values. c() combine function from R creates a vector with the passed in values.

c(1, 2, 5)

[1] 1 2 5

%in% operator from R returns true if an item matches values in a given vector.

1 %in% c(1, 2, 5)

[1] TRUE

3 %in% c(1, 2, 5)

[1] FALSE

select observations by multiple license

iNaturalist observations, photos, and sounds are covered by licenses. The default license is CC BY-NC (Creative Commons: Attribution-NonCommercial) so other people can use the content if they give attribution to you and use it for non-commercial purposes. More info about iNaturalist licenses and various Creative Commons licenses.

iNaturalist exports observations with No Copyright (CC0), Attribution (CC BY), and Attribution-NonCommercial (CC BY-NC) license to Global Biodiversity Information Facility (GBIF), an international organization that provides access to biodiversity information. Many researchers who use iNaturalist data get their data from GBIF. This means if iNaturalist observers want their data to be used by scientists, they need to use one of those three licenses.

We can use table() to see the license types and count.

table(inat_data$license)


      CC-BY    CC-BY-NC CC-BY-NC-ND CC-BY-NC-SA    CC-BY-ND    CC-BY-SA 
       5384      129677        1199        2934          35          79 
        CC0 
       4934

Let’s get observations with CC0, CC-BY, or CC-BY-NC license. filter(license %in% c('CC0', 'CC-BY', 'CC-BY-NC')) will return rows where the license field is in the vector (‘CC0’, ‘CC-BY’, ‘CC-BY-NC’)

my_data <- inat_data %>%
  filter(license %in% c('CC0', 'CC-BY', 'CC-BY-NC')) %>%
  select(user_login, observed_on, common_name, license)

dim(my_data)

[1] 139995      4

unique(my_data$license)

[1] "CC-BY"    "CC-BY-NC" "CC0"

Exercise 2

Get all your observations that are research grade

use my_inat_data from Exercise 1 to access CNC observations
Use & with filter() since we want to pick observations by both username and quality grade. Use ‘quantron’ as the user if you don’t have CNC observations.
Use select() to pick 4 columns

my_inat_data %>%
  filter(user_login == 'natureinla' & 
           quality_grade == 'research') %>%
  select(user_login, observed_on, common_name, scientific_name)

# A tibble: 1,556 × 4
   user_login observed_on common_name            scientific_name              
   <chr>      <date>      <chr>                  <chr>                        
 1 natureinla 2016-04-14  Red-eared Slider       Trachemys scripta elegans    
 2 natureinla 2016-04-14  Monarch                Danaus plexippus             
 3 natureinla 2016-04-14  San Diego Gopher Snake Pituophis catenifer annectens
 4 natureinla 2016-04-14  California Towhee      Melozone crissalis           
 5 natureinla 2016-04-14  Cooper's Hawk          Astur cooperii               
 6 natureinla 2016-04-14  Monarch                Danaus plexippus             
 7 natureinla 2016-04-14  Allen's Hummingbird    Selasphorus sasin            
 8 natureinla 2016-04-15  Northern Mockingbird   Mimus polyglottos            
 9 natureinla 2016-04-15  House Sparrow          Passer domesticus            
10 natureinla 2016-04-15  Indian Peafowl         Pavo cristatus               
# ℹ 1,546 more rows

Find items with wildcard or partial search

Previously we used common_name == 'Western Fence Lizard' which did an exact match for 'Western Fence Lizard'. But a lot of the times we want to search for a phrase, not an exact match.

Let’s find all species common names that have the word ‘lizard’.

unique(inat_data$common_name) will return all common names. Use length() to get the number of items.

common_names <- unique(inat_data$common_name) 

length(common_names)

[1] 7260

We have over 7000 common names.

str_subset() from stringr package will find all items that match a given pattern. The first argument is the items we are searching through. The second argument pattern is the pattern we are looking for.

Here we are searching through common names for any names that contain ‘lizard’.

str_subset(common_names, pattern = 'lizard')

character(0)

When we use pattern = 'lizard', we get zero results. The reason is that str_subset() is case sensitive. It is looking for lowercase ‘lizard’.

To have a case insensitive match, we need to pass in (?i) at the beginning of the pattern. This will look find matches for ‘lizard’ no matter the case.

str_subset(common_names, pattern = '(?i)lizard')

 [1] "Western Side-blotched Lizard"   "Western Fence Lizard"          
 [3] "Southern Alligator Lizard"      "Great Basin Fence Lizard"      
 [5] "Common Side-blotched Lizard"    "Island Night Lizard"           
 [7] "San Diego Alligator Lizard"     "Sceloporine Lizards"           
 [9] "Lizards"                        "Blainville's Horned Lizard"    
[11] "Southern Sagebrush Lizard"      "Snakes and Lizards"            
[13] "Wall Lizards"                   "Yellow-backed Spiny Lizard"    
[15] "Ocellated Lizard"               "Spiny Lizards"                 
[17] "San Diegan Legless Lizard"      "Desert Night Lizard"           
[19] "Zebra-tailed Lizard"            "Northern Legless Lizard"       
[21] "Southern Italian Wall Lizard"   "Phrynosomatid Lizards"         
[23] "San Clemente Night Lizard"      "Texas Alligator Lizard"        
[25] "Long-nosed Leopard Lizard"      "Italian Wall Lizard"           
[27] "North American Legless Lizards" "Desert Collared Lizard"        
[29] "Ornate Tree Lizard"

All the results have ‘Lizard’, which explains why pattern = 'lizard' did not work.

Let’s look for all common names with the word ‘ants’.

str_subset(common_names, pattern = '(?i)ants')

 [1] "plants"                                 
 [2] "century plants"                         
 [3] "Typical American Harvester Ants"        
 [4] "Ants, Bees, Wasps, and Sawflies"        
 [5] "currants and gooseberries"              
 [6] "flowering plants"                       
 [7] "Ants"                                   
 [8] "vascular plants"                        
 [9] "pincushion plants"                      
[10] "Stone plants"                           
[11] "bird-of-paradise plants"                
[12] "Ants, Bees, and Stinging Wasps"         
[13] "Pyramid Ants"                           
[14] "Wood, Mound, and Field Ants"            
[15] "Myrmicine Ants"                         
[16] "Odorous Ants"                           
[17] "Cormorants and Shags"                   
[18] "Carpenter Ants"                         
[19] "Narrow-waisted Wasps, Ants, and Bees"   
[20] "Molesta-group Thief Ants"               
[21] "Acorn Ants and Allies"                  
[22] "gumplants"                              
[23] "Big-headed Ants"                        
[24] "dewplants"                              
[25] "Leptomyrmecin Ants"                     
[26] "Solenopsis Fire Ants and Thief Ants"    
[27] "Lasiin Ants"                            
[28] "fallax-group Big-headed Ants"           
[29] "Acrobat Ants"                           
[30] "cast-iron plants"                       
[31] "Cigar Plants and Allies"                
[32] "Formicine Ants"                         
[33] "Citronella Ants, Fuzzy Ants, and Allies"
[34] "ice plants"                             
[35] "Furrowed Ants"                          
[36] "Ruminants"                              
[37] "fusca-group Field Ants and Allies"      
[38] "Velvety Tree Ants"                      
[39] "Airplants"                              
[40] "Sneaking Ants"                          
[41] "radiator plants"                        
[42] "Camponotin Ants"                        
[43] "American Cormorants"                    
[44] "Californicus-group Harvester Ants"      
[45] "Pheasants, Grouse, and Allies"          
[46] "threadplants"                           
[47] "Spider Wasps, Velvet Ants, and Allies"  
[48] "North American pitcher plants"          
[49] "Pavement Ants"                          
[50] "Pincushion plants"

The results return names with the word ‘plants’ because it ‘ants’ is part of ‘plants’. If we want to only find matches for the word ‘ants’ , we need to use \\b

\\b before and after “ants” will look for the word “ants”.

str_subset(common_names, pattern = "(?i)\\bants\\b")

 [1] "Typical American Harvester Ants"        
 [2] "Ants, Bees, Wasps, and Sawflies"        
 [3] "Ants"                                   
 [4] "Ants, Bees, and Stinging Wasps"         
 [5] "Pyramid Ants"                           
 [6] "Wood, Mound, and Field Ants"            
 [7] "Myrmicine Ants"                         
 [8] "Odorous Ants"                           
 [9] "Carpenter Ants"                         
[10] "Narrow-waisted Wasps, Ants, and Bees"   
[11] "Molesta-group Thief Ants"               
[12] "Acorn Ants and Allies"                  
[13] "Big-headed Ants"                        
[14] "Leptomyrmecin Ants"                     
[15] "Solenopsis Fire Ants and Thief Ants"    
[16] "Lasiin Ants"                            
[17] "fallax-group Big-headed Ants"           
[18] "Acrobat Ants"                           
[19] "Formicine Ants"                         
[20] "Citronella Ants, Fuzzy Ants, and Allies"
[21] "Furrowed Ants"                          
[22] "fusca-group Field Ants and Allies"      
[23] "Velvety Tree Ants"                      
[24] "Sneaking Ants"                          
[25] "Camponotin Ants"                        
[26] "Californicus-group Harvester Ants"      
[27] "Spider Wasps, Velvet Ants, and Allies"  
[28] "Pavement Ants"

\\b before ‘ant’ will look for the words that start with ‘ant’ such as ‘ant’, ‘ants’, ‘anthuriums’. We use [0:30] to show the first 30 matches.

str_subset(common_names, pattern = "(?i)\\bant")[0:30]

 [1] "Typical American Harvester Ants"     
 [2] "Argentine Ant"                       
 [3] "Ants, Bees, Wasps, and Sawflies"     
 [4] "California Harvester Ant"            
 [5] "Ants"                                
 [6] "Western Velvety Tree Ant"            
 [7] "American Winter Ant"                 
 [8] "Anthemid Aphids"                     
 [9] "Francoeur's Field Ant"               
[10] "Ant-mimic Sac Spiders"               
[11] "Ergatogyne Trailing Ant"             
[12] "Southern Fire Ant"                   
[13] "Pacific Velvet Ant"                  
[14] "Ants, Bees, and Stinging Wasps"      
[15] "Pyramid Ants"                        
[16] "Red Imported Fire Ant"               
[17] "Wood, Mound, and Field Ants"         
[18] "Apache Twig Ant"                     
[19] "Andre's Harvester Ant"               
[20] "Myrmicine Ants"                      
[21] "Odorous Ants"                        
[22] "Odorous House Ant"                   
[23] "Antlions and Owlflies"               
[24] "Bicolored Pyramid Ant"               
[25] "Antlions, Lacewings, and Allies"     
[26] "Dark Rover Ant"                      
[27] "Anteater Scarabs"                    
[28] "Carpenter Ants"                      
[29] "Black Harvester Ant"                 
[30] "Narrow-waisted Wasps, Ants, and Bees"

\\b after ‘ant’ will look for the words that end with ‘ant’ such as ‘ant’, ‘plant’, ‘giant’.

str_subset(common_names, pattern = "(?i)ant\\b")[0:30]

 [1] "Double-crested Cormorant"    "Argentine Ant"              
 [3] "giant reed"                  "Giant Canyon Woodlouse"     
 [5] "California Harvester Ant"    "fragrant pitcher sage"      
 [7] "Elegant Clarkia"             "golden currant"             
 [9] "Four-lined Plant Bug"        "Fiddleneck Plant Bug"       
[11] "Western Giant Swallowtail"   "Spider plant"               
[13] "jade plant"                  "distant phacelia"           
[15] "Crystalline ice plant"       "Brandt's Cormorant"         
[17] "Giant Kelp"                  "American century plant"     
[19] "pink trailing iceplant"      "Slender Iceplant"           
[21] "giant chain fern"            "Giant Water Bugs"           
[23] "Western Velvety Tree Ant"    "Island Tarplant"            
[25] "American Winter Ant"         "California beeplant"        
[27] "Plant-parasitic Hemipterans" "giant woollystar"           
[29] "fragrant sumac"              "Snowplant"

Now that we have a list of ant names, we can use %in% to look for multiple ant species.

ants <- c(
"Acorn Ants and Allies",
"Acrobat Ants",
"Argentine Ant",
"Big-headed Ants",
"Californicus-group Harvester Ants",
"Camponotin Ants",
"Carpenter Ants",
"Citronella Ants, Fuzzy Ants, and Allies",
"fallax-group Big-headed Ants",
"Formicine Ants",
"Furrowed Ants",
"Lasiin Ants",
"Leptomyrmecin Ants",
"Molesta-group Thief Ants",
"Myrmicine Ants",
"Pavement Ants",
"Pyramid Ants",
"Sneaking Ants",
"Sneaking Ants",
"Solenopsis Fire Ants and Thief Ants",
"Velvety Tree Ants",
"Velvety Tree Ants"
)

ants_obs <- inat_data %>%
  filter(common_name %in% ants) %>%
  select(user_login, observed_on, common_name)

dim(ants_obs)

[1] 446   3

More complex queries

Sometimes we want to use both & | to select the rows. You can use multiple filter() statements. Multiple filter() is the equivalent of &.

select observations by multiple user_login and common_name

Let’s get observations where user is ‘cdegroof’ or ‘deedeeflower5’, and species is ‘Western Fence Lizard’.

complex_query <- inat_data %>%
  filter(user_login == 'cdegroof' | 
           user_login == 'deedeeflower5') %>%
  filter(common_name == 'Western Fence Lizard')  %>%
  select(user_login, common_name, scientific_name, observed_on)

dim(complex_query)

[1] 33  4

unique(complex_query$common_name)

[1] "Western Fence Lizard"

unique(complex_query$user_login)

[1] "cdegroof"      "deedeeflower5"

Note

This query using just | & with one filter() does not give us what we want.

alt_1 <- inat_data %>%
  filter(user_login == 'cdegroof' | 
           user_login == 'deedeeflower5' & 
           common_name == 'Western Fence Lizard')  %>%
  select(user_login, common_name, scientific_name, observed_on)

dim(alt_1)

[1] 374   4

unique(alt_1$user_login)

[1] "cdegroof"      "deedeeflower5"

unique(alt_1$common_name) %>% length

[1] 137

We get 2 users but 137 common names.

In most programming languages and is evaluated before or. Our query asked for all observations by ‘deedeeflower5’ for ‘Western Fence Lizard’, and all observations by ‘cdegroof’.

This query using | & () does give us what we want. We used parenthesis around the two user_login.

alt_2 <- inat_data %>%
  filter((user_login == 'cdegroof' | user_login == 'deedeeflower5') &
           common_name == 'Western Fence Lizard')  %>%
  select(user_login, common_name, scientific_name, observed_on)

dim(alt_2)

[1] 33  4

unique(alt_2$user_login)

[1] "cdegroof"      "deedeeflower5"

unique(alt_2$common_name)

[1] "Western Fence Lizard"

We get 2 users and 1 common name.

Exercise 3

Get all your observations for two species

Use my_inat_data to access CNC observations
Use unique(my_obs$common_names) from Exercise 1 to find two species name.
Use filter(), | to pick two species
Use filter() to pick your username. Use ‘quantron’ as the user if you don’t have CNC observations.
Use select() to pick four columns.

unique(my_obs$common_name)[0:10]

 [1] "Red-eared Slider"       "Monarch"                "San Diego Gopher Snake"
 [4] "California Towhee"      "Cooper's Hawk"          "tropical milkweed"     
 [7] "Allen's Hummingbird"    "Northern Mockingbird"   "House Sparrow"         
[10] "Indian Peafowl"

my_inat_data %>%
  filter(user_login == 'natureinla') %>%
  filter(common_name == 'Red-eared Slider' | common_name=='Monarch') %>%
  select(user_login, observed_on, common_name, scientific_name)

# A tibble: 44 × 4
   user_login observed_on common_name      scientific_name          
   <chr>      <date>      <chr>            <chr>                    
 1 natureinla 2016-04-14  Red-eared Slider Trachemys scripta elegans
 2 natureinla 2016-04-14  Monarch          Danaus plexippus         
 3 natureinla 2016-04-14  Monarch          Danaus plexippus         
 4 natureinla 2016-04-14  Monarch          Danaus plexippus         
 5 natureinla 2016-04-14  Red-eared Slider Trachemys scripta elegans
 6 natureinla 2016-04-16  Monarch          Danaus plexippus         
 7 natureinla 2016-04-15  Monarch          Danaus plexippus         
 8 natureinla 2016-04-17  Monarch          Danaus plexippus         
 9 natureinla 2016-04-15  Monarch          Danaus plexippus         
10 natureinla 2016-04-15  Monarch          Danaus plexippus         
# ℹ 34 more rows

Add new columns with mutate()

Another common task is creating a new column based on values in existing columns. For example, we could add a new column for year.

Tip

Vector is a list of items. We can access specific values in a vector by using vector_name[number]. To access a range of values use vector_name[start_number:end_number]

letters <- c('a','b','c', 'd')

get first item

letters[1]

[1] "a"

get second and third item

letters[2:3]

[1] "b" "c"

Let’s get observed_on for rows 10317 to 10320. The reason we picked these rows is because the year changes from 2016 to 2017.

inat_data$observed_on[10317:10320]

[1] "2016-04-18" "2016-04-16" "2017-04-14" "2017-04-15"

Let’s get use year() to get the year from observed_on for rows 10317 to 10320

year(inat_data$observed_on)[10317:10320]

[1] 2016 2016 2017 2017

We can use mutate() from dplyr and year() from lubridate to add a year column. For mutate(), we pass in the name of the new column, and the value of the column.

temp <- inat_data %>%
  mutate(year = year(observed_on))

We can also use table() to see the number of observations per year.

table(temp$year)


 2016  2017  2018  2019  2020  2021  2022  2023  2024 
10392 17495 19164 34057 19524 22549 19597 26602 22258

Use class() to check the data type.

class(temp$year)

[1] "numeric"

select observations by year

Let’s get all observations for 2020. Use mutate() and year() to add year column. Then use filter() to select rows where year is 2020.

temp <- inat_data %>%
  mutate(year = year(observed_on)) %>%
  filter(year == 2020)

unique(temp$year)

[1] 2020

Since year column contains numbers, we can do greater than or less than comparison.

Let’s get observations between 2018 and 2020, (e.g. 2018 2019 2020).

temp <- inat_data %>%
  mutate(year = year(observed_on)) %>%
  filter(year >= 2018 & year <= 2020)

unique(temp$year)

[1] 2018 2019 2020

Exercise 4

Get all of your observations from 2024.

Use my_inat_data to access CNC observations
Use mutate() and year() to add year column
Use filter() to pick observations with your username and year is 2024. Use ‘quantron’ as the user if you don’t have CNC observations.
Use select() to pick 4 columns

my_inat_data %>%
   mutate(year = year(observed_on)) %>%
  filter(user_login == 'natureinla' & year == 2024) %>%
  select(user_login, observed_on, common_name, scientific_name)

# A tibble: 1 × 4
  user_login observed_on common_name                scientific_name             
  <chr>      <date>      <chr>                      <chr>                       
1 natureinla 2024-04-29  San Diego Alligator Lizard Elgaria multicarinata webbii

Count the number of rows with count()

We can use count() from dplyr to count the number of values for one or more columns. We pass in the column names as arguments to count()

get observations per year

Let’s try counting of all observations by year. Use mutate() to add a year column. Use count() to count the number of observations for each year. By default, count will add a new column called n.

inat_data %>%
  mutate(year = year(observed_on)) %>%
  count(year)

# A tibble: 9 × 2
   year     n
  <dbl> <int>
1  2016 10392
2  2017 17495
3  2018 19164
4  2019 34057
5  2020 19524
6  2021 22549
7  2022 19597
8  2023 26602
9  2024 22258

We can specify the name of the count column by passing in name argument to count().

inat_data %>%
  mutate(year = year(observed_on)) %>%
  count(year, name='obs_count')

# A tibble: 9 × 2
   year obs_count
  <dbl>     <int>
1  2016     10392
2  2017     17495
3  2018     19164
4  2019     34057
5  2020     19524
6  2021     22549
7  2022     19597
8  2023     26602
9  2024     22258

get top ten most observed species

Let’s count the number of observations for each species. We will pass in both ‘common_name’ and ‘scientific_name’ to count() because some species don’t have a common_name.

counts <- inat_data %>%
  count(common_name, scientific_name, name='obs_count')

counts

# A tibble: 9,865 × 3
   common_name                             scientific_name        obs_count
   <chr>                                   <chr>                      <int>
 1 Abalone                                 Haliotis                       7
 2 Abbott's bushmallow                     Malacothamnus abbottii         1
 3 Abelias                                 Abelia                         1
 4 Abert's Thread-waisted Wasp             Ammophila aberti               3
 5 Abyssinian banana                       Ensete ventricosum             1
 6 Acacia Psyllid                          Acizzia uncatoides             2
 7 Acacias, Mimosas, mesquites, and allies Mimosoideae                   10
 8 Acalyptrate Flies                       Acalyptratae                  66
 9 Acanthus                                Acanthus                      23
10 Achilid Planthoppers                    Achilidae                      1
# ℹ 9,855 more rows

It’s often useful to take a look at the results in some order, like the lowest count to highest. We can use the arrange() function from dplyr for that. We pass in the columns we want to order by to arrange(). By default, arrange() will return values from lowest to highest.

counts <- inat_data %>%
  count(common_name, scientific_name, name='obs_count')   %>%
  arrange(obs_count)

counts

# A tibble: 9,865 × 3
   common_name                     scientific_name         obs_count
   <chr>                           <chr>                       <int>
 1 Abbott's bushmallow             Malacothamnus abbottii          1
 2 Abelias                         Abelia                          1
 3 Abyssinian banana               Ensete ventricosum              1
 4 Achilid Planthoppers            Achilidae                       1
 5 Acorn Moth                      Blastobasis glandulella         1
 6 Acotylean Flatworms             Acotylea                        1
 7 Active Free-living Bristleworms Errantia                        1
 8 Afghan Tortoise                 Testudo horsfieldii             1
 9 African Clawed Frog             Xenopus laevis                  1
10 African Milk Weed               Euphorbia trigona               1
# ℹ 9,855 more rows

If we want to reverse the order, we can wrap the column names in desc() from dplyr. This will return values from highest to lowest.

counts <- inat_data %>%
  count(common_name, scientific_name, name='obs_count') %>%
  arrange(desc(obs_count))

counts

# A tibble: 9,865 × 3
   common_name            scientific_name          obs_count
   <chr>                  <chr>                        <int>
 1 Western Fence Lizard   Sceloporus occidentalis       3339
 2 Western Honey Bee      Apis mellifera                2060
 3 dicots                 Magnoliopsida                 2013
 4 plants                 Plantae                       1712
 5 Eastern Fox Squirrel   Sciurus niger                 1475
 6 House Finch            Haemorhous mexicanus          1263
 7 Mourning Dove          Zenaida macroura              1205
 8 flowering plants       Angiospermae                  1161
 9 California poppy       Eschscholzia californica       934
10 Convergent Lady Beetle Hippodamia convergens          929
# ℹ 9,855 more rows

use slice() from dplyr to return only certain number of records. slice(start:end) will return rows from the starting number to the ending number.

Let’s get the top ten species with the most observations.

counts <- inat_data %>%
  count(common_name, scientific_name, name='obs_count') %>%
  arrange(desc(obs_count))  %>%
  slice(1:10)

counts

# A tibble: 10 × 3
   common_name            scientific_name          obs_count
   <chr>                  <chr>                        <int>
 1 Western Fence Lizard   Sceloporus occidentalis       3339
 2 Western Honey Bee      Apis mellifera                2060
 3 dicots                 Magnoliopsida                 2013
 4 plants                 Plantae                       1712
 5 Eastern Fox Squirrel   Sciurus niger                 1475
 6 House Finch            Haemorhous mexicanus          1263
 7 Mourning Dove          Zenaida macroura              1205
 8 flowering plants       Angiospermae                  1161
 9 California poppy       Eschscholzia californica       934
10 Convergent Lady Beetle Hippodamia convergens          929

Count higher taxa

Let’s count the observations by kingdom.

counts <- inat_data %>%
  count(taxon_kingdom_name, name='obs_count') %>%
  arrange(desc(obs_count)) 

counts

# A tibble: 8 × 2
  taxon_kingdom_name obs_count
  <chr>                  <int>
1 Plantae                98242
2 Animalia               90127
3 Fungi                   2149
4 Chromista                743
5 Protozoa                 187
6 <NA>                     174
7 Bacteria                  11
8 Viruses                    5

Let’s count observations for phylums in the Animal kingdom. Use filter() to select ‘Animalia’ kingdom. Then count the taxon_phylum_name.

counts <- inat_data %>%
  filter(taxon_kingdom_name == 'Animalia') %>%
  count(taxon_phylum_name, name='obs_count') %>%
  arrange(desc(obs_count)) 

counts

# A tibble: 17 × 2
   taxon_phylum_name obs_count
   <chr>                 <int>
 1 Arthropoda            42739
 2 Chordata              40073
 3 Mollusca               5735
 4 Cnidaria                600
 5 Echinodermata           327
 6 Annelida                300
 7 <NA>                    114
 8 Platyhelminthes          93
 9 Bryozoa                  44
10 Rotifera                 40
11 Porifera                 37
12 Nematoda                  9
13 Nematomorpha              8
14 Ctenophora                3
15 Phoronida                 3
16 Nemertea                  1
17 Tardigrada                1

Exercise 5

Get the number of observation you made per year

Use my_inat_data to access CNC observations
Use mutate() and year() to add year column
Use count() to count the number of observations per year
Use filter() to select observations with your username. Use ‘quantron’ as the user if you don’t have CNC observations.

my_inat_data %>%
  mutate(year = year(observed_on)) %>%
  filter(user_login == 'natureinla') %>%
  count(year, name='obs_count')

# A tibble: 8 × 2
   year obs_count
  <dbl>     <int>
1  2016       930
2  2017      1055
3  2018       599
4  2019       350
5  2020        10
6  2021         2
7  2023         9
8  2024         1

Save data

If you want to save your results, you can save the data frames as CSVs.

For instance, a user might only want to their observations that are research grade and unobscured location.

First, assign the data frame to an object.

my_obs <- inat_data %>%
  filter(user_login == 'natureinla' & 
           quality_grade == 'research' & 
           coordinates_obscured == FALSE) 

my_obs

# A tibble: 1,296 × 37
        id observed_on time_observed_at user_id user_login user_name  created_at
     <dbl> <date>      <chr>              <dbl> <chr>      <chr>      <chr>     
 1 2935688 2016-04-14  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
 2 2935724 2016-04-14  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
 3 2935782 2016-04-14  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
 4 2954406 2016-04-16  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
 5 2954533 2016-04-16  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
 6 2954609 2016-04-16  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
 7 2954698 2016-04-16  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
 8 2954805 2016-04-16  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
 9 2966003 2016-04-16  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
10 2966084 2016-04-16  <NA>               21786 natureinla NHMLA Com… 2016-04-1…
# ℹ 1,286 more rows
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
#   url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
#   description <chr>, captive_cultivated <lgl>, latitude <dbl>,
#   longitude <dbl>, positional_accuracy <dbl>,
#   public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
#   coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>, …

Then use write_csv() from dplyr to create a CSV.

The first argument is the data frame to save.
The second argument is the relative path of where to save the file.
To keep our files organized, we can save the csv in data/cleaned or results.
You should give the file a sensible name to help you remember what is in the file. Some people add the date to the file name to keep track of the various versions.
By default NA values will be saved as ‘NA’ string. na='' will save NA values as empty strings.

write_csv(my_obs, here('data/cleaned/my_observations.csv'), na='')