library(readr) # read and write tabular data
library(dplyr) # manipulate data
library(lubridate) # manipulate dates
library(here) # file paths
library(stringr) # work with string
Working with data
Questions
- How do you work with iNaturalist CSV data in R?
Objectives
- Import CSV data into R.
- Select rows and columns of data.frames.
- Use pipes to link steps together into pipelines.
- Create new data.frame columns using existing columns.
- Export data to a CSV file.
Exploring iNaturalist data
A CSV of iNaturalist observations for City Nature Challenge Los Angeles from 2015 to 2024 is located at “data/cleaned/cnc-los-angeles-observations.csv”. We are going to read that CSV using R.
Functions
Functions are predefined bits of code that do a specific task. Arguments are values that we pass into a function. Function usually takes one or more arguments as input, does something to the values, and produces the ouput.
R packages
R itself has many built-in functions, but we can access many more by installing and loading other packages of functions and data into R. We will use several R packages for the workshop.
To install these packages, use install.packages()
function from R. We pass in the package names as arguments. The name of the packages must be in quotes.
install.packages("readr")
R will connect to the internet and download packages from servers that have R packages. R will then install the packages on your computer. The console window will show you the progress of the installation process.
To save time, we have already installed all the packages we need for the workshop.
In order to use a package, use library()
function from R to load the package. We pass in the name of the package as an argument. Do not use quotes around the package name when using library()
.
library(readr)
Reading a CSV file
In order to analyze the iNaturalist csv, we need to load readr, lubridate, dplyr, and here packages.
Generally it is a good idea to list all the libraries that you will use in the script at the beginning of the script. You want to install a package to your computer once, and then load it with library()
in each script where you need to use it.
When we reference other files from an R script, we need to give R precise instructions on where those files are. We do that using something called a file path.
There are two kinds of paths: absolute and relative. Absolute paths are specific to a particular computer, whereas relative paths are relative to a certain folder. Because we are using RStudio “project” feature, all of our paths is relative to the project folder. For instance an absolute path is “/Users/username/Documents/CNC-coding-workshop/data/cleaned/cnc-los-angeles-observations.csv”, and relative path is “data/cleaned/cnc-los-angeles-observations.csv”.
here is an R package that makes it easier to handle file paths.
We call read_csv()
function from readr, and pass in a relative path to a CSV file in order to load the CSV.
read_csv()
will read the file and return the content of the file as data.frame. data.frame is how R handles data with rows and columns. In order for us access the content later on, we will assign the content to an object called inat_data
.
<- read_csv(here('data/cleaned/cnc-los-angeles-observations.csv')) inat_data
We can use the glimpse()
function from dplyr get a summary about the contents of inat_data
. It shows the number of rows and columns. For each column, it shows the name, data type (dbl, chr, lgl, date), and the first few values.
glimpse(inat_data)
Rows: 191,638
Columns: 37
$ id <dbl> 2931940, 2934641, 2934961, 2934980, 2934994…
$ observed_on <date> 2016-04-14, 2016-04-14, 2016-04-14, 2016-0…
$ time_observed_at <chr> "2016-04-14 19:25:00 UTC", "2016-04-14 19:0…
$ user_id <dbl> 151043, 10814, 80445, 80445, 80445, 121033,…
$ user_login <chr> "msmorales", "smartrf", "cdegroof", "cdegro…
$ user_name <chr> "Michael Morales", "Richard Smart (he, him)…
$ created_at <chr> "2016-04-14 07:28:36 UTC", "2016-04-14 19:0…
$ updated_at <chr> "2021-12-26 06:58:04 UTC", "2018-05-28 02:0…
$ quality_grade <chr> "research", "needs_id", "research", "resear…
$ license <chr> "CC-BY", "CC-BY-NC", NA, NA, NA, "CC-BY-NC"…
$ url <chr> "http://www.inaturalist.org/observations/29…
$ image_url <chr> "https://inaturalist-open-data.s3.amazonaws…
$ sound_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ tag_list <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ description <chr> "Spotted on a the wall of a planter, while …
$ captive_cultivated <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ latitude <dbl> 34.05829, 34.01742, 34.13020, 34.13143, 34.…
$ longitude <dbl> -117.8219, -118.2892, -118.8226, -118.8215,…
$ positional_accuracy <dbl> 4, 5, NA, NA, NA, NA, 17, 55, 55, 55, NA, 5…
$ public_positional_accuracy <dbl> 4, 5, NA, NA, NA, NA, 17, 55, 55, 55, NA, 5…
$ geoprivacy <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxon_geoprivacy <chr> NA, NA, NA, "open", "open", NA, "open", NA,…
$ coordinates_obscured <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ scientific_name <chr> "Cornu aspersum", "Oestroidea", "Arphia ram…
$ common_name <chr> "Garden Snail", "Bot Flies, Blow Flies, and…
$ iconic_taxon_name <chr> "Mollusca", "Insecta", "Insecta", "Reptilia…
$ taxon_id <dbl> 480298, 356157, 54247, 36100, 36204, 69731,…
$ taxon_kingdom_name <chr> "Animalia", "Animalia", "Animalia", "Animal…
$ taxon_phylum_name <chr> "Mollusca", "Arthropoda", "Arthropoda", "Ch…
$ taxon_class_name <chr> "Gastropoda", "Insecta", "Insecta", "Reptil…
$ taxon_order_name <chr> "Stylommatophora", "Diptera", "Orthoptera",…
$ taxon_family_name <chr> "Helicidae", NA, "Acrididae", "Phrynosomati…
$ taxon_genus_name <chr> "Cornu", NA, "Arphia", "Uta", "Sceloporus",…
$ taxon_species_name <chr> "Cornu aspersum", NA, "Arphia ramona", "Uta…
$ taxon_subspecies_name <chr> NA, NA, NA, "Uta stansburiana elegans", NA,…
$ threatened <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ establishment_means <chr> "introduced", NA, "native", "native", "nati…
We can view the first six rows with the head()
function, and the last six rows with the tail()
function:
head(inat_data)
# A tibble: 6 × 37
id observed_on time_observed_at user_id user_login user_name created_at
<dbl> <date> <chr> <dbl> <chr> <chr> <chr>
1 2931940 2016-04-14 2016-04-14 19:25:… 151043 msmorales Michael … 2016-04-1…
2 2934641 2016-04-14 2016-04-14 19:02:… 10814 smartrf Richard … 2016-04-1…
3 2934961 2016-04-14 2016-04-14 19:15:… 80445 cdegroof Chris De… 2016-04-1…
4 2934980 2016-04-14 2016-04-14 19:18:… 80445 cdegroof Chris De… 2016-04-1…
5 2934994 2016-04-14 2016-04-14 19:19:… 80445 cdegroof Chris De… 2016-04-1…
6 2935037 2016-04-14 2016-04-14 19:36:… 121033 ttempel <NA> 2016-04-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
# url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
# description <chr>, captive_cultivated <lgl>, latitude <dbl>,
# longitude <dbl>, positional_accuracy <dbl>,
# public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
# coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
# iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …
tail(inat_data)
# A tibble: 6 × 37
id observed_on time_observed_at user_id user_login user_name created_at
<dbl> <date> <chr> <dbl> <chr> <chr> <chr>
1 254128969 2024-04-28 2024-04-28 17:1… 2834615 thannavic… Thanna V… 2024-12-0…
2 255041807 2024-04-26 2024-04-26 23:3… 5347031 epiphyte78 <NA> 2024-12-1…
3 255041881 2024-04-26 2024-04-26 22:1… 5347031 epiphyte78 <NA> 2024-12-1…
4 255041985 2024-04-26 2024-04-26 22:1… 5347031 epiphyte78 <NA> 2024-12-1…
5 255042063 2024-04-26 2024-04-26 20:4… 5347031 epiphyte78 <NA> 2024-12-1…
6 255042124 2024-04-26 2024-04-26 19:1… 5347031 epiphyte78 <NA> 2024-12-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
# url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
# description <chr>, captive_cultivated <lgl>, latitude <dbl>,
# longitude <dbl>, positional_accuracy <dbl>,
# public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
# coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
# iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …
You can use View()
function from R to open an interactive viewer, which behaves like a simplified version of a spreadsheet program. If you hover over the tab for the interactive View()
, you can click the “x” that appears, which will close the tab.
View(inat_data)
You can use names()
from R to see the fields in the data frame.
names(inat_data)
[1] "id" "observed_on"
[3] "time_observed_at" "user_id"
[5] "user_login" "user_name"
[7] "created_at" "updated_at"
[9] "quality_grade" "license"
[11] "url" "image_url"
[13] "sound_url" "tag_list"
[15] "description" "captive_cultivated"
[17] "latitude" "longitude"
[19] "positional_accuracy" "public_positional_accuracy"
[21] "geoprivacy" "taxon_geoprivacy"
[23] "coordinates_obscured" "scientific_name"
[25] "common_name" "iconic_taxon_name"
[27] "taxon_id" "taxon_kingdom_name"
[29] "taxon_phylum_name" "taxon_class_name"
[31] "taxon_order_name" "taxon_family_name"
[33] "taxon_genus_name" "taxon_species_name"
[35] "taxon_subspecies_name" "threatened"
[37] "establishment_means"
We can use dim()
dimension function from R to get the dimension of a data frame. It returns the number of rows and number of columns.
dim(inat_data)
[1] 191638 37
inat_data
has over 193K rows and 37 columns.
More about functions
To learn more about a function, you can type a ?
in front of the name of the function, which will bring up the official documentation for that function:
?head
Function documentation is written by the authors of the functions, so they can vary pretty widely in their style and readability. The first section, Description, gives you a concise description of what the function does, but it may not always be enough. The Arguments section defines all the arguments for the function and is usually worth reading thoroughly. Finally, the Examples section at the end will often have some helpful examples that you can run to get a sense of what the function is doing.
The help Arguments section for head()
shows four arguments. The first argument x
is required, the rest are optional. For example, the n
argument in head()
specifies the number of rows to print. It defaults to 6, but we can override that by specifying a different number:
head(x = inat_data, n = 10)
# A tibble: 10 × 37
id observed_on time_observed_at user_id user_login user_name created_at
<dbl> <date> <chr> <dbl> <chr> <chr> <chr>
1 2931940 2016-04-14 2016-04-14 19:25… 151043 msmorales Michael … 2016-04-1…
2 2934641 2016-04-14 2016-04-14 19:02… 10814 smartrf Richard … 2016-04-1…
3 2934961 2016-04-14 2016-04-14 19:15… 80445 cdegroof Chris De… 2016-04-1…
4 2934980 2016-04-14 2016-04-14 19:18… 80445 cdegroof Chris De… 2016-04-1…
5 2934994 2016-04-14 2016-04-14 19:19… 80445 cdegroof Chris De… 2016-04-1…
6 2935037 2016-04-14 2016-04-14 19:36… 121033 ttempel <NA> 2016-04-1…
7 2935117 2016-04-15 <NA> 76855 bradrumble <NA> 2016-04-1…
8 2935139 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
9 2935176 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
10 2935181 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
# url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
# description <chr>, captive_cultivated <lgl>, latitude <dbl>,
# longitude <dbl>, positional_accuracy <dbl>,
# public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
# coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
# iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …
If we order the argument the same order they are listed in help Arguments section, we don’t have to name them:
head(inat_data, 10)
# A tibble: 10 × 37
id observed_on time_observed_at user_id user_login user_name created_at
<dbl> <date> <chr> <dbl> <chr> <chr> <chr>
1 2931940 2016-04-14 2016-04-14 19:25… 151043 msmorales Michael … 2016-04-1…
2 2934641 2016-04-14 2016-04-14 19:02… 10814 smartrf Richard … 2016-04-1…
3 2934961 2016-04-14 2016-04-14 19:15… 80445 cdegroof Chris De… 2016-04-1…
4 2934980 2016-04-14 2016-04-14 19:18… 80445 cdegroof Chris De… 2016-04-1…
5 2934994 2016-04-14 2016-04-14 19:19… 80445 cdegroof Chris De… 2016-04-1…
6 2935037 2016-04-14 2016-04-14 19:36… 121033 ttempel <NA> 2016-04-1…
7 2935117 2016-04-15 <NA> 76855 bradrumble <NA> 2016-04-1…
8 2935139 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
9 2935176 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
10 2935181 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
# url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
# description <chr>, captive_cultivated <lgl>, latitude <dbl>,
# longitude <dbl>, positional_accuracy <dbl>,
# public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
# coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
# iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …
Additionally, if we name them, we can put them in any order we want:
head(n = 10, x = inat_data)
# A tibble: 10 × 37
id observed_on time_observed_at user_id user_login user_name created_at
<dbl> <date> <chr> <dbl> <chr> <chr> <chr>
1 2931940 2016-04-14 2016-04-14 19:25… 151043 msmorales Michael … 2016-04-1…
2 2934641 2016-04-14 2016-04-14 19:02… 10814 smartrf Richard … 2016-04-1…
3 2934961 2016-04-14 2016-04-14 19:15… 80445 cdegroof Chris De… 2016-04-1…
4 2934980 2016-04-14 2016-04-14 19:18… 80445 cdegroof Chris De… 2016-04-1…
5 2934994 2016-04-14 2016-04-14 19:19… 80445 cdegroof Chris De… 2016-04-1…
6 2935037 2016-04-14 2016-04-14 19:36… 121033 ttempel <NA> 2016-04-1…
7 2935117 2016-04-15 <NA> 76855 bradrumble <NA> 2016-04-1…
8 2935139 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
9 2935176 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
10 2935181 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
# url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
# description <chr>, captive_cultivated <lgl>, latitude <dbl>,
# longitude <dbl>, positional_accuracy <dbl>,
# public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
# coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>,
# iconic_taxon_name <chr>, taxon_id <dbl>, taxon_kingdom_name <chr>, …
Manipulating data
One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data. The dplyr package provide a series of powerful functions for many common data manipulation tasks.
select()
The select()
function is used to select certain columns of a data frame. The first argument is the data frame, and the rest of the arguments are unquoted names of the columns you want.
Our inat_data
data frame has 37 columns. We want four columns: user_login
, common_name
, scientific_name
, observed_on
.
select(inat_data, user_login, common_name, scientific_name, observed_on)
# A tibble: 191,638 × 4
user_login common_name scientific_name observed_on
<chr> <chr> <chr> <date>
1 msmorales Garden Snail Cornu aspersum 2016-04-14
2 smartrf Bot Flies, Blow Flies, and Allies Oestroidea 2016-04-14
3 cdegroof California Orange-winged Grasshopp… Arphia ramona 2016-04-14
4 cdegroof Western Side-blotched Lizard Uta stansburia… 2016-04-14
5 cdegroof Western Fence Lizard Sceloporus occ… 2016-04-14
6 ttempel <NA> Coelocnemis 2016-04-14
7 bradrumble House Sparrow Passer domesti… 2016-04-15
8 deedeeflower5 Amur Carp Cyprinus rubro… 2016-04-14
9 deedeeflower5 Red-eared Slider Trachemys scri… 2016-04-14
10 deedeeflower5 Mallard Anas platyrhyn… 2016-04-14
# ℹ 191,628 more rows
select()
creates a new data frame with 193K rows, and 4 columns.
filter()
The filter()
function is used to select rows that match certain criteria. The first argument is the name of the data frame, and the second argument is the selection criteria.
select observations by common_name
Let’s find all the observations for ‘Western Fence Lizard’, the most popular species in CNC Los Angeles. We want all the rows where common_name
is equal to ‘Western Fence Lizard’. Use ==
to test for equality.
filter(inat_data, common_name == 'Western Fence Lizard')
# A tibble: 3,339 × 37
id observed_on time_observed_at user_id user_login user_name created_at
<dbl> <date> <chr> <dbl> <chr> <chr> <chr>
1 2934994 2016-04-14 2016-04-14 19:19… 80445 cdegroof Chris De… 2016-04-1…
2 2935263 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
3 2935420 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
4 2935748 2016-04-14 2016-04-14 21:01… 80445 cdegroof Chris De… 2016-04-1…
5 2935965 2016-04-14 2016-04-14 19:44… 171443 lchroman <NA> 2016-04-1…
6 2938607 2016-04-14 2016-04-14 23:33… 146517 maiz <NA> 2016-04-1…
7 2940103 2016-04-15 2016-04-15 16:31… 80984 kimssight Kim Moore 2016-04-1…
8 2940838 2016-04-15 2016-04-15 17:11… 201119 sarahwenn… <NA> 2016-04-1…
9 2940848 2016-04-15 2016-04-15 17:17… 201119 sarahwenn… <NA> 2016-04-1…
10 2940855 2016-04-15 2016-04-15 17:42… 201119 sarahwenn… <NA> 2016-04-1…
# ℹ 3,329 more rows
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
# url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
# description <chr>, captive_cultivated <lgl>, latitude <dbl>,
# longitude <dbl>, positional_accuracy <dbl>,
# public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
# coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>, …
filter()
creates a new data frame with 3,339 rows, and 37 columns.
Keep in mind that species can have zero to multiple common names. If you use want to search by common name, you need to use the exact common name that iNaturalist uses.
select observations by scientific_name
Let’s find all the observations for ‘Sceloporus occidentalis’, the Latin scientific name for ‘Western Fence Lizard’.
filter(inat_data, scientific_name == 'Sceloporus occidentalis')
# A tibble: 3,339 × 37
id observed_on time_observed_at user_id user_login user_name created_at
<dbl> <date> <chr> <dbl> <chr> <chr> <chr>
1 2934994 2016-04-14 2016-04-14 19:19… 80445 cdegroof Chris De… 2016-04-1…
2 2935263 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
3 2935420 2016-04-14 <NA> 216108 deedeeflo… <NA> 2016-04-1…
4 2935748 2016-04-14 2016-04-14 21:01… 80445 cdegroof Chris De… 2016-04-1…
5 2935965 2016-04-14 2016-04-14 19:44… 171443 lchroman <NA> 2016-04-1…
6 2938607 2016-04-14 2016-04-14 23:33… 146517 maiz <NA> 2016-04-1…
7 2940103 2016-04-15 2016-04-15 16:31… 80984 kimssight Kim Moore 2016-04-1…
8 2940838 2016-04-15 2016-04-15 17:11… 201119 sarahwenn… <NA> 2016-04-1…
9 2940848 2016-04-15 2016-04-15 17:17… 201119 sarahwenn… <NA> 2016-04-1…
10 2940855 2016-04-15 2016-04-15 17:42… 201119 sarahwenn… <NA> 2016-04-1…
# ℹ 3,329 more rows
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
# url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
# description <chr>, captive_cultivated <lgl>, latitude <dbl>,
# longitude <dbl>, positional_accuracy <dbl>,
# public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
# coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>, …
We get 3,339 rows, and 37 columns, the same as common_name == 'Western Fence Lizard'
.
We will cover how to search for species more in the “Higher taxa” lesson.
The pipe: %>%
What happens if we want to select columns and filter rows?
We use the pipe operator %>%
to call multiple functions.
You can insert %>%
by using the keyboard shortcut Shift+Cmd+M (Mac) or Shift+Ctrl+M (Windows).
select observations by user_login
iNaturalist has two fields for the user name: user_login
and user_name
. iNaturalist displays the user_login
for each observation, and displays user_name
on the user’s profile page.
Let’s get all observations for iNaturalist user ‘natureinla’, and we only want columns user_login
, common_name
, scientific_name
, observed_on
. Since we need both filter()
and select()
, we use pipe operator %>%
.
Pipe operator take the thing on the left hand side and insert it as the first argument of the function on the right hand side.
%>%
inat_data filter(user_login == 'natureinla') %>%
select(user_login, common_name, scientific_name, observed_on)
# A tibble: 2,956 × 4
user_login common_name scientific_name observed_on
<chr> <chr> <chr> <date>
1 natureinla Red-eared Slider Trachemys scripta elegans 2016-04-14
2 natureinla Monarch Danaus plexippus 2016-04-14
3 natureinla San Diego Gopher Snake Pituophis catenifer annectens 2016-04-14
4 natureinla California Towhee Melozone crissalis 2016-04-14
5 natureinla Cooper's Hawk Astur cooperii 2016-04-14
6 natureinla Monarch Danaus plexippus 2016-04-14
7 natureinla tropical milkweed Asclepias curassavica 2016-04-14
8 natureinla Allen's Hummingbird Selasphorus sasin 2016-04-14
9 natureinla Northern Mockingbird Mimus polyglottos 2016-04-15
10 natureinla House Sparrow Passer domesticus 2016-04-15
# ℹ 2,946 more rows
It can be helpful to think of %>%
as meaning “and then”. inat_data
is sent to filter()
function. filter()
selects rows with ‘natureinla’. And then the output from filter()
is sent into the select()
function. select()
selects 4 columns.
select observations by coordinates_obscured
Sometimes the coordinates for iNaturalist observations are obscured. For instance, when the observation involves an endangered species, iNaturalist will automatically obscure the coordinates in order to protect the species. Sometimes people choose to obscure their location when they are making observations so that other people will not know their exact location. iNaturalist has information about obscured coordinates.
To access one column in a data frame, use dataframe$column_name
.
$coordinates_obscured inat_data
When we pass in a data frame column to table()
function from R, it will return the unique values in a column, and the number of rows that contain each value.
Use table()
to get a count of how many observations have obscured locations by passing in the data frame column.
table(inat_data$coordinates_obscured)
FALSE TRUE
176942 14696
176K row are false (coordinates are normal), 14K rows are true (coordinates are obscured).
If the exact location of the observation will affect your analysis, then you want unobscured coordinates. Let’s get the observations where the coordinates are not obscured.
%>%
inat_data filter(coordinates_obscured == FALSE) %>%
select(user_login, common_name, scientific_name, observed_on)
# A tibble: 176,942 × 4
user_login common_name scientific_name observed_on
<chr> <chr> <chr> <date>
1 msmorales Garden Snail Cornu aspersum 2016-04-14
2 smartrf Bot Flies, Blow Flies, and Allies Oestroidea 2016-04-14
3 cdegroof California Orange-winged Grasshopp… Arphia ramona 2016-04-14
4 cdegroof Western Side-blotched Lizard Uta stansburia… 2016-04-14
5 cdegroof Western Fence Lizard Sceloporus occ… 2016-04-14
6 ttempel <NA> Coelocnemis 2016-04-14
7 bradrumble House Sparrow Passer domesti… 2016-04-15
8 deedeeflower5 Amur Carp Cyprinus rubro… 2016-04-14
9 deedeeflower5 Red-eared Slider Trachemys scri… 2016-04-14
10 deedeeflower5 Mallard Anas platyrhyn… 2016-04-14
# ℹ 176,932 more rows
When using both filter()
and select()
, it is a good idea to use filter()
before select()
. The following code will cause an error “object ‘coordinates_obscured’ not found”.
%>%
inat_data select(user_login, common_name, scientific_name, observed_on) %>%
filter(coordinates_obscured == FALSE)
select()
creates a data frame with four fields. When we try to filter()
using coordinates_obscured
, we get an error because the 4-field data frame we pass to filter()
does not have the field coordinates_obscured
.
select observations by quality_grade
iNaturalist gives a quality grade to each observation. The observations are labeled as ‘needs_id’, ‘research’, or ‘casual’. iNaturalist FAQ about quality grade.
To see all the unique values for a column, use unique()
function from R and pass in the data frame column.
unique(inat_data$quality_grade)
[1] "research" "needs_id" "casual"
When researchers use iNaturalist data, they normally use research grade observations. Let’s get the observations that are research grade.
%>%
inat_data filter(quality_grade == 'research') %>%
select(user_login, common_name, scientific_name, observed_on)
# A tibble: 107,491 × 4
user_login common_name scientific_name observed_on
<chr> <chr> <chr> <date>
1 msmorales Garden Snail Cornu aspersum 2016-04-14
2 cdegroof California Orange-winged Grasshopp… Arphia ramona 2016-04-14
3 cdegroof Western Side-blotched Lizard Uta stansburia… 2016-04-14
4 cdegroof Western Fence Lizard Sceloporus occ… 2016-04-14
5 deedeeflower5 Red-eared Slider Trachemys scri… 2016-04-14
6 deedeeflower5 Mallard Anas platyrhyn… 2016-04-14
7 lchroman Cactus Wren Campylorhynchu… 2016-04-14
8 deedeeflower5 Desert Cottontail Sylvilagus aud… 2016-04-14
9 deedeeflower5 Western Fence Lizard Sceloporus occ… 2016-04-14
10 deedeeflower5 Eastern Fox Squirrel Sciurus niger 2016-04-14
# ℹ 107,481 more rows
Errors in code
We are writing instructions for the computer. If there is a typo, misspelling, wrong function arguments, etc, the code will not work and we will see errors. R will display the errors in red. You need to fix the errors in order for the code to work. Here are some example errors.
typo: we used %>
, when it should be %>%
>
inat_data %select(user_login, observed_on, common_name)
Misspelled user_logi
%>%
inat_data select(user_logi, observed_on, common_name)
typo: we use =
, when it should be ==
%>%
inat_data filter(user_login = 'natureinla')
typo: extra )
%>%
inat_data select(user_login, observed_on, common_name))
Exercise 1
Get all your City Nature Challenge observations.
- Use
read_csv()
to load the CNC CSV. Assign the results tomy_inat_data
object. - Use
filter()
to select observations with your iNaturalist username. If you don’t have any CNC observations, use ‘quantron’ the most prolific community scientist for CNC Los Angeles. - Use
select()
to select 4 columns. One of the columns should becommon_name
- assign the results of
filter()
andselect()
tomy_obs
object - click on
my_obs
in the Environment tab to see the results
<- read_csv(here('data/cleaned/cnc-los-angeles-observations.csv')) my_inat_data
Rows: 191638 Columns: 37
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (26): time_observed_at, user_login, user_name, created_at, updated_at, ...
dbl (7): id, user_id, latitude, longitude, positional_accuracy, public_pos...
lgl (3): captive_cultivated, coordinates_obscured, threatened
date (1): observed_on
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- my_inat_data %>%
my_obs filter(user_login == 'natureinla') %>%
select(user_login, observed_on, common_name, scientific_name)
Logical operators
In previous examples we used one criteria in filter()
to select some rows. Often times we want to use multiple criteria to select some rows. Logical operators allow you to do multiple comparisons at once.
and operator: &
If there are multiple criteria, and we want to get rows that match all of the criteria, we use and operator &
in between the criteria.
condtion_1 & condition_2
select observations by common_name and quality_grade
Let’s get all ‘Western Fence Lizard’ observations that are research grade. This means we want to get rows where common_name
is ‘Western Fence Lizard’ and quality_grade
is ‘research’.
<- inat_data %>%
my_data filter(common_name == 'Western Fence Lizard' &
== 'research') %>%
quality_grade select(user_login, common_name, scientific_name, observed_on, quality_grade)
View(my_data)
We can check the results to make sure we wrote we got the data we want. We can use unique()
to check the column values.
unique(my_data$common_name)
[1] "Western Fence Lizard"
unique(my_data$quality_grade)
[1] "research"
select observations by coordinates_obscured and positional_accuracy
Previously we looked at coordinates_obscured
. In addition to coordinates being intentionally obscured, another thing that can affect the coordinates for an observation is the accuracy of the coordinates. The accuracy of GPS on smart phones depends on the hardware, software, physical environment, etc. The positional_accuracy
from iNaturalist measures the coordinates error in meters. For example if an observation has a positional accuracy of 65 meters, this means the measured coordinates is within 65 meters from the actual coordinates.
When given a column in a dataframe, summary()
displays statistics about the values. Let’s use summary()
to look at the positional accuracy of observations where the coordinates are not obscured.
<- inat_data %>%
my_data filter(coordinates_obscured == FALSE)
summary(my_data$positional_accuracy)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 5 12 2070 65 13227987 36601
Min. means the minimal value is 0.
1st Qu. means 25% of the values are less than 5, and 75% are greater than 5.
Median means 50% of the values are less than 12, 50% are greater than 12.
Mean is the sum of the values divided by number of items.
3rd Qu. means 75% of the values are less than 65, and 25% are greater than 65.
Max. means the maximum value is 13,227,987.
NA’s means there are 36,601 rows without positional_accuracy.
Even though we selected unobscured locations, 25% of the observations are 65 or more meters away the actual location due to the accuracy of GPS device.
If location accuracy is important to your analysis, you can select a small number for position accuracy. Let’s get observations with unobscured locations that have position accuracy less than 5 meters.
<- inat_data %>%
my_data filter(coordinates_obscured == FALSE &
<= 5) %>%
positional_accuracy select(user_login, common_name, scientific_name, positional_accuracy, coordinates_obscured)
dim(my_data)
[1] 41417 5
We have 41K observations with position accuracy less than 5 meters.
unique(my_data$coordinates_obscured)
[1] FALSE
unique(my_data$positional_accuracy)
[1] 4 5 3 2 1 0
or operator: |
If there are multiple criteria, and we want to get rows that match one or more of the criteria, we use or operator |
or in between the criteria.
condition_1 | condition_2
select observations by multiple common_name
Let’s get observations where common_name
is ‘Western Fence Lizard’ or ‘Western Honey Bee’.
<- inat_data %>%
my_data filter(common_name == 'Western Honey Bee' |
== 'Western Fence Lizard') %>%
common_name select(user_login, observed_on, common_name)
dim(my_data)
[1] 5399 3
unique(my_data$common_name)
[1] "Western Fence Lizard" "Western Honey Bee"
& (and) versus | (or)
& (and) return rows where all conditions are true. This code looks for observations where user_login
is ‘natureinla’ and common_name
is ‘Western Fence Lizard’.
<- inat_data %>%
and_data filter(user_login == 'natureinla' &
== 'Western Fence Lizard')
common_name
dim(and_data)
[1] 79 37
unique(and_data$user_login)
[1] "natureinla"
unique(and_data$common_name)
[1] "Western Fence Lizard"
We get 79 rows with 1 user_login
and 1 common_name
| (or) returns rows where any conditions are true. This code looks for observations where user_login
is ‘natureinla’ plus observations where common_name
is ‘Western Fence Lizard’
<- inat_data %>%
or_data filter(user_login == 'natureinla' |
== 'Western Fence Lizard')
common_name
dim(or_data)
[1] 6216 37
unique(or_data$user_login) %>% length
[1] 1052
unique(or_data$common_name) %>% length
[1] 1031
We get 6,216 rows with 1052 user_login and 1031 common_name
&
vs |
will return different results. Check the results of your code to make sure your results matches what you intended.
%in% c()
Another way to get rows that match one or more of the criteria is with the in operator %in%.
A vector is way R stores multiple values. c()
combine function from R creates a vector with the passed in values.
c(1, 2, 5)
[1] 1 2 5
%in%
operator from R returns true if an item matches values in a given vector.
1 %in% c(1, 2, 5)
[1] TRUE
3 %in% c(1, 2, 5)
[1] FALSE
select observations by multiple license
iNaturalist observations, photos, and sounds are covered by licenses. The default license is CC BY-NC (Creative Commons: Attribution-NonCommercial) so other people can use the content if they give attribution to you and use it for non-commercial purposes. More info about iNaturalist licenses and various Creative Commons licenses.
iNaturalist exports observations with No Copyright (CC0), Attribution (CC BY), and Attribution-NonCommercial (CC BY-NC) license to Global Biodiversity Information Facility (GBIF), an international organization that provides access to biodiversity information. Many researchers who use iNaturalist data get their data from GBIF. This means if iNaturalist observers want their data to be used by scientists, they need to use one of those three licenses.
We can use table()
to see the license types and count.
table(inat_data$license)
CC-BY CC-BY-NC CC-BY-NC-ND CC-BY-NC-SA CC-BY-ND CC-BY-SA
5384 129677 1199 2934 35 79
CC0
4934
Let’s get observations with CC0, CC-BY, or CC-BY-NC license. filter(license %in% c('CC0', 'CC-BY', 'CC-BY-NC'))
will return rows where the license
field is in the vector (‘CC0’, ‘CC-BY’, ‘CC-BY-NC’)
<- inat_data %>%
my_data filter(license %in% c('CC0', 'CC-BY', 'CC-BY-NC')) %>%
select(user_login, observed_on, common_name, license)
dim(my_data)
[1] 139995 4
unique(my_data$license)
[1] "CC-BY" "CC-BY-NC" "CC0"
Exercise 2
Get all your observations that are research grade
- use
my_inat_data
from Exercise 1 to access CNC observations - Use
&
withfilter()
since we want to pick observations by both username and quality grade. Use ‘quantron’ as the user if you don’t have CNC observations. - Use
select()
to pick 4 columns
%>%
my_inat_data filter(user_login == 'natureinla' &
== 'research') %>%
quality_grade select(user_login, observed_on, common_name, scientific_name)
# A tibble: 1,556 × 4
user_login observed_on common_name scientific_name
<chr> <date> <chr> <chr>
1 natureinla 2016-04-14 Red-eared Slider Trachemys scripta elegans
2 natureinla 2016-04-14 Monarch Danaus plexippus
3 natureinla 2016-04-14 San Diego Gopher Snake Pituophis catenifer annectens
4 natureinla 2016-04-14 California Towhee Melozone crissalis
5 natureinla 2016-04-14 Cooper's Hawk Astur cooperii
6 natureinla 2016-04-14 Monarch Danaus plexippus
7 natureinla 2016-04-14 Allen's Hummingbird Selasphorus sasin
8 natureinla 2016-04-15 Northern Mockingbird Mimus polyglottos
9 natureinla 2016-04-15 House Sparrow Passer domesticus
10 natureinla 2016-04-15 Indian Peafowl Pavo cristatus
# ℹ 1,546 more rows
Find items with wildcard or partial search
Previously we used common_name == 'Western Fence Lizard'
which did an exact match for 'Western Fence Lizard'.
But a lot of the times we want to search for a phrase, not an exact match.
Let’s find all species common names that have the word ‘lizard’.
unique(inat_data$common_name)
will return all common names. Use length()
to get the number of items.
<- unique(inat_data$common_name)
common_names
length(common_names)
[1] 7260
We have over 7000 common names.
str_subset()
from stringr package will find all items that match a given pattern. The first argument is the items we are searching through. The second argument pattern
is the pattern we are looking for.
Here we are searching through common names for any names that contain ‘lizard’.
str_subset(common_names, pattern = 'lizard')
character(0)
When we use pattern = 'lizard'
, we get zero results. The reason is that str_subset()
is case sensitive. It is looking for lowercase ‘lizard’.
To have a case insensitive match, we need to pass in (?i)
at the beginning of the pattern. This will look find matches for ‘lizard’ no matter the case.
str_subset(common_names, pattern = '(?i)lizard')
[1] "Western Side-blotched Lizard" "Western Fence Lizard"
[3] "Southern Alligator Lizard" "Great Basin Fence Lizard"
[5] "Common Side-blotched Lizard" "Island Night Lizard"
[7] "San Diego Alligator Lizard" "Sceloporine Lizards"
[9] "Lizards" "Blainville's Horned Lizard"
[11] "Southern Sagebrush Lizard" "Snakes and Lizards"
[13] "Wall Lizards" "Yellow-backed Spiny Lizard"
[15] "Ocellated Lizard" "Spiny Lizards"
[17] "San Diegan Legless Lizard" "Desert Night Lizard"
[19] "Zebra-tailed Lizard" "Northern Legless Lizard"
[21] "Southern Italian Wall Lizard" "Phrynosomatid Lizards"
[23] "San Clemente Night Lizard" "Texas Alligator Lizard"
[25] "Long-nosed Leopard Lizard" "Italian Wall Lizard"
[27] "North American Legless Lizards" "Desert Collared Lizard"
[29] "Ornate Tree Lizard"
All the results have ‘Lizard’, which explains why pattern = 'lizard'
did not work.
Let’s look for all common names with the word ‘ants’.
str_subset(common_names, pattern = '(?i)ants')
[1] "plants"
[2] "century plants"
[3] "Typical American Harvester Ants"
[4] "Ants, Bees, Wasps, and Sawflies"
[5] "currants and gooseberries"
[6] "flowering plants"
[7] "Ants"
[8] "vascular plants"
[9] "pincushion plants"
[10] "Stone plants"
[11] "bird-of-paradise plants"
[12] "Ants, Bees, and Stinging Wasps"
[13] "Pyramid Ants"
[14] "Wood, Mound, and Field Ants"
[15] "Myrmicine Ants"
[16] "Odorous Ants"
[17] "Cormorants and Shags"
[18] "Carpenter Ants"
[19] "Narrow-waisted Wasps, Ants, and Bees"
[20] "Molesta-group Thief Ants"
[21] "Acorn Ants and Allies"
[22] "gumplants"
[23] "Big-headed Ants"
[24] "dewplants"
[25] "Leptomyrmecin Ants"
[26] "Solenopsis Fire Ants and Thief Ants"
[27] "Lasiin Ants"
[28] "fallax-group Big-headed Ants"
[29] "Acrobat Ants"
[30] "cast-iron plants"
[31] "Cigar Plants and Allies"
[32] "Formicine Ants"
[33] "Citronella Ants, Fuzzy Ants, and Allies"
[34] "ice plants"
[35] "Furrowed Ants"
[36] "Ruminants"
[37] "fusca-group Field Ants and Allies"
[38] "Velvety Tree Ants"
[39] "Airplants"
[40] "Sneaking Ants"
[41] "radiator plants"
[42] "Camponotin Ants"
[43] "American Cormorants"
[44] "Californicus-group Harvester Ants"
[45] "Pheasants, Grouse, and Allies"
[46] "threadplants"
[47] "Spider Wasps, Velvet Ants, and Allies"
[48] "North American pitcher plants"
[49] "Pavement Ants"
[50] "Pincushion plants"
The results return names with the word ‘plants’ because it ‘ants’ is part of ‘plants’. If we want to only find matches for the word ‘ants’ , we need to use \\b
\\b
before and after “ants” will look for the word “ants”.
str_subset(common_names, pattern = "(?i)\\bants\\b")
[1] "Typical American Harvester Ants"
[2] "Ants, Bees, Wasps, and Sawflies"
[3] "Ants"
[4] "Ants, Bees, and Stinging Wasps"
[5] "Pyramid Ants"
[6] "Wood, Mound, and Field Ants"
[7] "Myrmicine Ants"
[8] "Odorous Ants"
[9] "Carpenter Ants"
[10] "Narrow-waisted Wasps, Ants, and Bees"
[11] "Molesta-group Thief Ants"
[12] "Acorn Ants and Allies"
[13] "Big-headed Ants"
[14] "Leptomyrmecin Ants"
[15] "Solenopsis Fire Ants and Thief Ants"
[16] "Lasiin Ants"
[17] "fallax-group Big-headed Ants"
[18] "Acrobat Ants"
[19] "Formicine Ants"
[20] "Citronella Ants, Fuzzy Ants, and Allies"
[21] "Furrowed Ants"
[22] "fusca-group Field Ants and Allies"
[23] "Velvety Tree Ants"
[24] "Sneaking Ants"
[25] "Camponotin Ants"
[26] "Californicus-group Harvester Ants"
[27] "Spider Wasps, Velvet Ants, and Allies"
[28] "Pavement Ants"
\\b
before ‘ant’ will look for the words that start with ‘ant’ such as ‘ant’, ‘ants’, ‘anthuriums’. We use [0:30]
to show the first 30 matches.
str_subset(common_names, pattern = "(?i)\\bant")[0:30]
[1] "Typical American Harvester Ants"
[2] "Argentine Ant"
[3] "Ants, Bees, Wasps, and Sawflies"
[4] "California Harvester Ant"
[5] "Ants"
[6] "Western Velvety Tree Ant"
[7] "American Winter Ant"
[8] "Anthemid Aphids"
[9] "Francoeur's Field Ant"
[10] "Ant-mimic Sac Spiders"
[11] "Ergatogyne Trailing Ant"
[12] "Southern Fire Ant"
[13] "Pacific Velvet Ant"
[14] "Ants, Bees, and Stinging Wasps"
[15] "Pyramid Ants"
[16] "Red Imported Fire Ant"
[17] "Wood, Mound, and Field Ants"
[18] "Apache Twig Ant"
[19] "Andre's Harvester Ant"
[20] "Myrmicine Ants"
[21] "Odorous Ants"
[22] "Odorous House Ant"
[23] "Antlions and Owlflies"
[24] "Bicolored Pyramid Ant"
[25] "Antlions, Lacewings, and Allies"
[26] "Dark Rover Ant"
[27] "Anteater Scarabs"
[28] "Carpenter Ants"
[29] "Black Harvester Ant"
[30] "Narrow-waisted Wasps, Ants, and Bees"
\\b
after ‘ant’ will look for the words that end with ‘ant’ such as ‘ant’, ‘plant’, ‘giant’.
str_subset(common_names, pattern = "(?i)ant\\b")[0:30]
[1] "Double-crested Cormorant" "Argentine Ant"
[3] "giant reed" "Giant Canyon Woodlouse"
[5] "California Harvester Ant" "fragrant pitcher sage"
[7] "Elegant Clarkia" "golden currant"
[9] "Four-lined Plant Bug" "Fiddleneck Plant Bug"
[11] "Western Giant Swallowtail" "Spider plant"
[13] "jade plant" "distant phacelia"
[15] "Crystalline ice plant" "Brandt's Cormorant"
[17] "Giant Kelp" "American century plant"
[19] "pink trailing iceplant" "Slender Iceplant"
[21] "giant chain fern" "Giant Water Bugs"
[23] "Western Velvety Tree Ant" "Island Tarplant"
[25] "American Winter Ant" "California beeplant"
[27] "Plant-parasitic Hemipterans" "giant woollystar"
[29] "fragrant sumac" "Snowplant"
Now that we have a list of ant names, we can use %in%
to look for multiple ant species.
<- c(
ants "Acorn Ants and Allies",
"Acrobat Ants",
"Argentine Ant",
"Big-headed Ants",
"Californicus-group Harvester Ants",
"Camponotin Ants",
"Carpenter Ants",
"Citronella Ants, Fuzzy Ants, and Allies",
"fallax-group Big-headed Ants",
"Formicine Ants",
"Furrowed Ants",
"Lasiin Ants",
"Leptomyrmecin Ants",
"Molesta-group Thief Ants",
"Myrmicine Ants",
"Pavement Ants",
"Pyramid Ants",
"Sneaking Ants",
"Sneaking Ants",
"Solenopsis Fire Ants and Thief Ants",
"Velvety Tree Ants",
"Velvety Tree Ants"
)
<- inat_data %>%
ants_obs filter(common_name %in% ants) %>%
select(user_login, observed_on, common_name)
dim(ants_obs)
[1] 446 3
More complex queries
Sometimes we want to use both &
|
to select the rows. You can use multiple filter()
statements. Multiple filter()
is the equivalent of &
.
select observations by multiple user_login and common_name
Let’s get observations where user is ‘cdegroof’ or ‘deedeeflower5’, and species is ‘Western Fence Lizard’.
<- inat_data %>%
complex_query filter(user_login == 'cdegroof' |
== 'deedeeflower5') %>%
user_login filter(common_name == 'Western Fence Lizard') %>%
select(user_login, common_name, scientific_name, observed_on)
dim(complex_query)
[1] 33 4
unique(complex_query$common_name)
[1] "Western Fence Lizard"
unique(complex_query$user_login)
[1] "cdegroof" "deedeeflower5"
This query using just |
&
with one filter()
does not give us what we want.
<- inat_data %>%
alt_1 filter(user_login == 'cdegroof' |
== 'deedeeflower5' &
user_login == 'Western Fence Lizard') %>%
common_name select(user_login, common_name, scientific_name, observed_on)
dim(alt_1)
[1] 374 4
unique(alt_1$user_login)
[1] "cdegroof" "deedeeflower5"
unique(alt_1$common_name) %>% length
[1] 137
We get 2 users but 137 common names.
In most programming languages and
is evaluated before or
. Our query asked for all observations by ‘deedeeflower5’ for ‘Western Fence Lizard’, and all observations by ‘cdegroof’.
This query using |
&
()
does give us what we want. We used parenthesis around the two user_login
.
<- inat_data %>%
alt_2 filter((user_login == 'cdegroof' | user_login == 'deedeeflower5') &
== 'Western Fence Lizard') %>%
common_name select(user_login, common_name, scientific_name, observed_on)
dim(alt_2)
[1] 33 4
unique(alt_2$user_login)
[1] "cdegroof" "deedeeflower5"
unique(alt_2$common_name)
[1] "Western Fence Lizard"
We get 2 users and 1 common name.
Exercise 3
Get all your observations for two species
- Use
my_inat_data
to access CNC observations - Use
unique(my_obs$common_names)
from Exercise 1 to find two species name. - Use
filter(), |
to pick two species - Use
filter()
to pick your username. Use ‘quantron’ as the user if you don’t have CNC observations. - Use
select()
to pick four columns.
unique(my_obs$common_name)[0:10]
[1] "Red-eared Slider" "Monarch" "San Diego Gopher Snake"
[4] "California Towhee" "Cooper's Hawk" "tropical milkweed"
[7] "Allen's Hummingbird" "Northern Mockingbird" "House Sparrow"
[10] "Indian Peafowl"
%>%
my_inat_data filter(user_login == 'natureinla') %>%
filter(common_name == 'Red-eared Slider' | common_name=='Monarch') %>%
select(user_login, observed_on, common_name, scientific_name)
# A tibble: 44 × 4
user_login observed_on common_name scientific_name
<chr> <date> <chr> <chr>
1 natureinla 2016-04-14 Red-eared Slider Trachemys scripta elegans
2 natureinla 2016-04-14 Monarch Danaus plexippus
3 natureinla 2016-04-14 Monarch Danaus plexippus
4 natureinla 2016-04-14 Monarch Danaus plexippus
5 natureinla 2016-04-14 Red-eared Slider Trachemys scripta elegans
6 natureinla 2016-04-16 Monarch Danaus plexippus
7 natureinla 2016-04-15 Monarch Danaus plexippus
8 natureinla 2016-04-17 Monarch Danaus plexippus
9 natureinla 2016-04-15 Monarch Danaus plexippus
10 natureinla 2016-04-15 Monarch Danaus plexippus
# ℹ 34 more rows
Add new columns with mutate()
Another common task is creating a new column based on values in existing columns. For example, we could add a new column for year.
Vector is a list of items. We can access specific values in a vector by using vector_name[number]
. To access a range of values use vector_name[start_number:end_number]
<- c('a','b','c', 'd') letters
get first item
1] letters[
[1] "a"
get second and third item
2:3] letters[
[1] "b" "c"
Let’s get observed_on
for rows 10317 to 10320. The reason we picked these rows is because the year changes from 2016 to 2017.
$observed_on[10317:10320] inat_data
[1] "2016-04-18" "2016-04-16" "2017-04-14" "2017-04-15"
Let’s get use year()
to get the year from observed_on
for rows 10317 to 10320
year(inat_data$observed_on)[10317:10320]
[1] 2016 2016 2017 2017
We can use mutate()
from dplyr and year()
from lubridate to add a year
column. For mutate()
, we pass in the name of the new column, and the value of the column.
<- inat_data %>%
temp mutate(year = year(observed_on))
We can also use table()
to see the number of observations per year.
table(temp$year)
2016 2017 2018 2019 2020 2021 2022 2023 2024
10392 17495 19164 34057 19524 22549 19597 26602 22258
Use class()
to check the data type.
class(temp$year)
[1] "numeric"
select observations by year
Let’s get all observations for 2020. Use mutate()
and year()
to add year
column. Then use filter()
to select rows where year is 2020.
<- inat_data %>%
temp mutate(year = year(observed_on)) %>%
filter(year == 2020)
unique(temp$year)
[1] 2020
Since year
column contains numbers, we can do greater than or less than comparison.
Let’s get observations between 2018 and 2020, (e.g. 2018 2019 2020).
<- inat_data %>%
temp mutate(year = year(observed_on)) %>%
filter(year >= 2018 & year <= 2020)
unique(temp$year)
[1] 2018 2019 2020
Exercise 4
Get all of your observations from 2024.
- Use
my_inat_data
to access CNC observations - Use
mutate()
andyear()
to add year column - Use
filter()
to pick observations with your username and year is 2024. Use ‘quantron’ as the user if you don’t have CNC observations. - Use
select()
to pick 4 columns
%>%
my_inat_data mutate(year = year(observed_on)) %>%
filter(user_login == 'natureinla' & year == 2024) %>%
select(user_login, observed_on, common_name, scientific_name)
# A tibble: 1 × 4
user_login observed_on common_name scientific_name
<chr> <date> <chr> <chr>
1 natureinla 2024-04-29 San Diego Alligator Lizard Elgaria multicarinata webbii
Count the number of rows with count()
We can use count()
from dplyr to count the number of values for one or more columns. We pass in the column names as arguments to count()
get observations per year
Let’s try counting of all observations by year. Use mutate()
to add a year column. Use count()
to count the number of observations for each year. By default, count will add a new column called n
.
%>%
inat_data mutate(year = year(observed_on)) %>%
count(year)
# A tibble: 9 × 2
year n
<dbl> <int>
1 2016 10392
2 2017 17495
3 2018 19164
4 2019 34057
5 2020 19524
6 2021 22549
7 2022 19597
8 2023 26602
9 2024 22258
We can specify the name of the count column by passing in name
argument to count()
.
%>%
inat_data mutate(year = year(observed_on)) %>%
count(year, name='obs_count')
# A tibble: 9 × 2
year obs_count
<dbl> <int>
1 2016 10392
2 2017 17495
3 2018 19164
4 2019 34057
5 2020 19524
6 2021 22549
7 2022 19597
8 2023 26602
9 2024 22258
get top ten most observed species
Let’s count the number of observations for each species. We will pass in both ‘common_name’ and ‘scientific_name’ to count()
because some species don’t have a common_name.
<- inat_data %>%
counts count(common_name, scientific_name, name='obs_count')
counts
# A tibble: 9,865 × 3
common_name scientific_name obs_count
<chr> <chr> <int>
1 Abalone Haliotis 7
2 Abbott's bushmallow Malacothamnus abbottii 1
3 Abelias Abelia 1
4 Abert's Thread-waisted Wasp Ammophila aberti 3
5 Abyssinian banana Ensete ventricosum 1
6 Acacia Psyllid Acizzia uncatoides 2
7 Acacias, Mimosas, mesquites, and allies Mimosoideae 10
8 Acalyptrate Flies Acalyptratae 66
9 Acanthus Acanthus 23
10 Achilid Planthoppers Achilidae 1
# ℹ 9,855 more rows
It’s often useful to take a look at the results in some order, like the lowest count to highest. We can use the arrange()
function from dplyr for that. We pass in the columns we want to order by to arrange()
. By default, arrange()
will return values from lowest to highest.
<- inat_data %>%
counts count(common_name, scientific_name, name='obs_count') %>%
arrange(obs_count)
counts
# A tibble: 9,865 × 3
common_name scientific_name obs_count
<chr> <chr> <int>
1 Abbott's bushmallow Malacothamnus abbottii 1
2 Abelias Abelia 1
3 Abyssinian banana Ensete ventricosum 1
4 Achilid Planthoppers Achilidae 1
5 Acorn Moth Blastobasis glandulella 1
6 Acotylean Flatworms Acotylea 1
7 Active Free-living Bristleworms Errantia 1
8 Afghan Tortoise Testudo horsfieldii 1
9 African Clawed Frog Xenopus laevis 1
10 African Milk Weed Euphorbia trigona 1
# ℹ 9,855 more rows
If we want to reverse the order, we can wrap the column names in desc()
from dplyr. This will return values from highest to lowest.
<- inat_data %>%
counts count(common_name, scientific_name, name='obs_count') %>%
arrange(desc(obs_count))
counts
# A tibble: 9,865 × 3
common_name scientific_name obs_count
<chr> <chr> <int>
1 Western Fence Lizard Sceloporus occidentalis 3339
2 Western Honey Bee Apis mellifera 2060
3 dicots Magnoliopsida 2013
4 plants Plantae 1712
5 Eastern Fox Squirrel Sciurus niger 1475
6 House Finch Haemorhous mexicanus 1263
7 Mourning Dove Zenaida macroura 1205
8 flowering plants Angiospermae 1161
9 California poppy Eschscholzia californica 934
10 Convergent Lady Beetle Hippodamia convergens 929
# ℹ 9,855 more rows
use slice()
from dplyr to return only certain number of records. slice(start:end)
will return rows from the starting number to the ending number.
Let’s get the top ten species with the most observations.
<- inat_data %>%
counts count(common_name, scientific_name, name='obs_count') %>%
arrange(desc(obs_count)) %>%
slice(1:10)
counts
# A tibble: 10 × 3
common_name scientific_name obs_count
<chr> <chr> <int>
1 Western Fence Lizard Sceloporus occidentalis 3339
2 Western Honey Bee Apis mellifera 2060
3 dicots Magnoliopsida 2013
4 plants Plantae 1712
5 Eastern Fox Squirrel Sciurus niger 1475
6 House Finch Haemorhous mexicanus 1263
7 Mourning Dove Zenaida macroura 1205
8 flowering plants Angiospermae 1161
9 California poppy Eschscholzia californica 934
10 Convergent Lady Beetle Hippodamia convergens 929
Count higher taxa
Let’s count the observations by kingdom.
<- inat_data %>%
counts count(taxon_kingdom_name, name='obs_count') %>%
arrange(desc(obs_count))
counts
# A tibble: 8 × 2
taxon_kingdom_name obs_count
<chr> <int>
1 Plantae 98242
2 Animalia 90127
3 Fungi 2149
4 Chromista 743
5 Protozoa 187
6 <NA> 174
7 Bacteria 11
8 Viruses 5
Let’s count observations for phylums in the Animal kingdom. Use filter()
to select ‘Animalia’ kingdom. Then count the taxon_phylum_name
.
<- inat_data %>%
counts filter(taxon_kingdom_name == 'Animalia') %>%
count(taxon_phylum_name, name='obs_count') %>%
arrange(desc(obs_count))
counts
# A tibble: 17 × 2
taxon_phylum_name obs_count
<chr> <int>
1 Arthropoda 42739
2 Chordata 40073
3 Mollusca 5735
4 Cnidaria 600
5 Echinodermata 327
6 Annelida 300
7 <NA> 114
8 Platyhelminthes 93
9 Bryozoa 44
10 Rotifera 40
11 Porifera 37
12 Nematoda 9
13 Nematomorpha 8
14 Ctenophora 3
15 Phoronida 3
16 Nemertea 1
17 Tardigrada 1
Exercise 5
Get the number of observation you made per year
- Use
my_inat_data
to access CNC observations - Use
mutate()
andyear()
to add year column - Use
count()
to count the number of observations per year - Use
filter()
to select observations with your username. Use ‘quantron’ as the user if you don’t have CNC observations.
%>%
my_inat_data mutate(year = year(observed_on)) %>%
filter(user_login == 'natureinla') %>%
count(year, name='obs_count')
# A tibble: 8 × 2
year obs_count
<dbl> <int>
1 2016 930
2 2017 1055
3 2018 599
4 2019 350
5 2020 10
6 2021 2
7 2023 9
8 2024 1
Save data
If you want to save your results, you can save the data frames as CSVs.
For instance, a user might only want to their observations that are research grade and unobscured location.
First, assign the data frame to an object.
<- inat_data %>%
my_obs filter(user_login == 'natureinla' &
== 'research' &
quality_grade == FALSE)
coordinates_obscured
my_obs
# A tibble: 1,296 × 37
id observed_on time_observed_at user_id user_login user_name created_at
<dbl> <date> <chr> <dbl> <chr> <chr> <chr>
1 2935688 2016-04-14 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
2 2935724 2016-04-14 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
3 2935782 2016-04-14 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
4 2954406 2016-04-16 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
5 2954533 2016-04-16 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
6 2954609 2016-04-16 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
7 2954698 2016-04-16 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
8 2954805 2016-04-16 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
9 2966003 2016-04-16 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
10 2966084 2016-04-16 <NA> 21786 natureinla NHMLA Com… 2016-04-1…
# ℹ 1,286 more rows
# ℹ 30 more variables: updated_at <chr>, quality_grade <chr>, license <chr>,
# url <chr>, image_url <chr>, sound_url <chr>, tag_list <chr>,
# description <chr>, captive_cultivated <lgl>, latitude <dbl>,
# longitude <dbl>, positional_accuracy <dbl>,
# public_positional_accuracy <dbl>, geoprivacy <chr>, taxon_geoprivacy <chr>,
# coordinates_obscured <lgl>, scientific_name <chr>, common_name <chr>, …
Then use write_csv()
from dplyr to create a CSV.
- The first argument is the data frame to save.
- The second argument is the relative path of where to save the file.
- To keep our files organized, we can save the csv in
data/cleaned
orresults
. - You should give the file a sensible name to help you remember what is in the file. Some people add the date to the file name to keep track of the various versions.
- By default
NA
values will be saved as ‘NA’ string.na=''
will saveNA
values as empty strings.
write_csv(my_obs, here('data/cleaned/my_observations.csv'), na='')