Content from Intro to Data Analysis


Last updated on 2024-05-24 | Edit this page

Overview

Questions

  • How do we begin to analyze iNaturalist data?

Objectives

  • Learn how to download data from iNaturalist
  • Learn about the various ways to analyze data

iNaturalist data

When iNaturalist users add an observation through the iNaturalist app, that data is saved to the iNaturalist database. People can download the iNaturalist data as CSVs.

CSV is a text file format for tabular data. Each line represents one record. Each column represents a field. The fields are separated by commas.

The iNaturalist CSV has information about:

  • the user who made observation
  • the location of the observation
  • the observed species
  • links for photos, videos, sounds

Instructions to download iNaturalist data

Here’s a video that shows how to select and download data from iNaturalist.

To save time and ensure everyone at the workshop has the same data, here is a link to a CSV with iNaturalist City Nature Challenge Los Angeles data from 2013 to 2024.

More info about downloading iNaturalist observation data.

https://help.inaturalist.org/en/support/solutions/articles/151000169670

https://help.inaturalist.org/en/support/solutions/articles/151000170342-how-can-i-download-data-from-inaturalist-

Analyzing data

It is difficult for people to see any patterns when reading rows after row of text. To make it easier to see patterns, we can use software to analyze tabular data.

  1. Spreadsheet programs are computer application that people can use to view, edit, and analyze tabular data. The programs can do calculations and create charts. Examples include Excel and Google Sheets

  2. Geographic information system (GIS) are computer application that people can use to view, edit and analyze geographic data. The programs can do calculations and create maps. Examples include ArcGIS and QGIS.

  3. Programming languages allow people to write instructions to tell a computer to do stuff. We write these instructions in the form of code. We can write code to do calculations, create charts, and create maps. Example programming languages include R, Python, C.

This class uses R because it is a popular language in ecology research and other types of scientific research.

Example of analyzing iNaturalist data using R

Load software that will will need.

R

library(readr) # read and write tabular data
library(dplyr) # manipulate data
library(ggplot2) # create data visualizations
library(sf) # geospatial
library(lubridate) # manipulate dates
library(mapview) # create interactive maps

Load iNaturalist data from City Nature Challenge Los Angeles 2013-2023. There are over 170,000 observations.

R

inat <- read_csv("data/raw/observations-397280.csv")

OUTPUT

Rows: 171155 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (23): observed_on_string, time_observed_at, time_zone, user_login, user...
dbl  (10): id, user_id, num_identification_agreements, num_identification_di...
lgl   (5): captive_cultivated, private_place_guess, private_latitude, privat...
date  (1): observed_on

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get the 10 most commonly observed ‘species’.

R

top_10 <- inat %>% 
  filter(!is.na(scientific_name)) %>%
  select(common_name, scientific_name) %>% 
  count(common_name, scientific_name, name='count')  %>%
  arrange(desc(count)) %>% 
  slice(1:10) 

top_10

OUTPUT

# A tibble: 10 × 3
   common_name            scientific_name         count
   <chr>                  <chr>                   <int>
 1 Western Fence Lizard   Sceloporus occidentalis  2970
 2 dicots                 Magnoliopsida            1978
 3 Western Honey Bee      Apis mellifera           1818
 4 plants                 Plantae                  1665
 5 Fox Squirrel           Sciurus niger            1323
 6 flowering plants       Angiospermae             1151
 7 House Finch            Haemorhous mexicanus     1122
 8 Mourning Dove          Zenaida macroura         1078
 9 Convergent Lady Beetle Hippodamia convergens     840
10 House Sparrow          Passer domesticus         838

Create a bar chart showing the number of observations per year.

R

inat %>% 
  mutate(year = year(observed_on))  %>%
  group_by(year) %>% 
  ggplot(aes(x = year)) +
  geom_bar()

Create a map showing all the observations for Western Fence Lizard

https://www.ecologi.st/spatial-r/rdemo.html#converting-a-dataframe-into-a-spatial-object

R

inat_lizard <- inat %>% 
  filter(!is.na(latitude) &
           !is.na(longitude) &
           !is.na(scientific_name)) %>% 
  st_as_sf(coords = c("longitude", "latitude"),   crs = 4326, remove=FALSE) %>% 
  select(id, user_login, common_name, scientific_name, observed_on,  url, longitude, latitude, geometry) %>% 
  filter(common_name == 'Western Fence Lizard')


mapview(inat_lizard)

Content from Introduction to R and RStudio


Last updated on 2024-05-24 | Edit this page

Overview

Questions

  • Why should you use R and RStudio?
  • How do you get started working in R and RStudio?

Objectives

  • Understand the difference between R and RStudio
  • Describe the purpose of the different RStudio panes
  • Organize files and directories into R Projects

What are R and RStudio?


R refers to a programming language as well as the software that runs R code.

RStudio is a software application that can make it easier to write and run code in R.

Think of R as English, and RStudio as Microsoft Word or Google Docs.

Both R and RStudio are: - free - cross platform: works on Mac, Windows, Linux - open source: allows anyone to use and modify the software

Getting set up in RStudio

RStudio provides a “Projects” feature that can make it easier to work on individual projects in R. We will create a project that we will keep everything for this workshop.

  1. Start RStudio (you should see a view similar to the screenshot above).
  2. In the top right, you will see a blue 3D cube and the words “Project: (None)”. Click on this icon.
  3. Click New Project from the dropdown menu.
  4. Click New Directory, then New Project.
  5. Type out a name for the project.
  6. Put it in a convenient location using the “Create project as a subdirectory of:” section. You can always move the project somewhere else later, because it will be self-contained.
  7. Click Create Project and your new project will open.

When you need to work on the project, click on the Rproj icon to start RStudio.

One of the benefits to using RStudio Projects is that they automatically set the working directory to the top-level folder for the project. The working directory is the folder where R is working, so it views the location of all files (including data and scripts) as being relative to the working directory.

Organizing your project directory

Using a consistent folder structure across all your new projects will help keep a growing project organized, and make it easy to find files in the future.

We will use a basic structure for this workshop.

project_folder
│
└── scripts
│
└── data
│    └── cleaned
│    └── raw
│
└─── images
│
└─── documents

Within our project folder, we first have a scripts folder to hold any scripts we write. We also have a data folder containing cleaned and raw subfolders. In general, you want to keep your raw data completely untouched, so once you put data into that folder, you do not modify it. Instead, you read it into R, and if you make any modifications, you write that modified file into the cleaned folder. We also have an images folder for plots we make, and a documents folder for any other documents you might produce.

Content from Exploring and understanding data


Last updated on 2024-05-24 | Edit this page

Overview

Questions

  • How does R store and represent data?

Objectives

  • Solve simple arithmetic operations in R.
  • Use comments to inform script.
  • Assign values to objects in R.
  • Call functions and use arguments to change their default options.
  • Understand vector types and missing data

Setup


Simple arithmetic operations


You can use R to do simple calculations

R

3 * 5

OUTPUT

[1] 15

R

3 + 5

OUTPUT

[1] 8

The results will be shown in the console.

Comments


All programming languages allow the programmer to include comments in their code to explain the code.

To do this in R we use the # character. Anything to the right of the # sign and up to the end of the line is treated as a comment and is ignored by R. You can start lines with comments or include them after any code on the line.

R

3 * 5  # my first comment 

OUTPUT

[1] 15

R

# my second comment

RStudio makes it easy to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard Ctrl + Shift + C. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press Ctrl + Shift + C.

Creating objects in R


A fundemental part of programming is assigning values to named objects. The value is stored in memory, and we can refer to value using the name of the object. To create an object, we need to give it a name followed by the assignment operator <-, and the value we want to give it.

R

rectangle_length <- 3

What we are doing here is taking the result of the code on the right side of the arrow, and assigning it to an object whose name is on the left side of the arrow. So, after executing rectangle_length <- 3, the value of rectangle_length is 3.

In RStudio, typing Alt + - (push Alt at the same time as the - key) will write <- in a single keystroke in a PC, while typing Option + - (push Option at the same time as the - key) does the same in a Mac.

Objects are displayed in Environment panel. Objects are stored in R memory, and can be accessed by typing the name of the object. If you restart R or RStudio, all the objects are deleted from memory.

R

rectangle_length

OUTPUT

[1] 3

Let’s create second object.

R

rectangle_width <- 5

Now that R has rectangle_length and rectangle_width in memory, we can do arithmetic with it.

R

rectangle_length * rectangle_width

OUTPUT

[1] 15

R

rectangle_length + rectangle_width

OUTPUT

[1] 8

We can also store the results in an object.

R

rectangle_area <- rectangle_length * rectangle_width

When assigning a value to an object, R does not print anything. You can force R to print the value by typing the object name:

R

rectangle_area <- rectangle_length * rectangle_width    # doesn't print anything
rectangle_area        # typing the name of the object prints the value of the object

OUTPUT

[1] 15

We can also change an object’s value by assigning it a new one:

R

rectangle_length <- 4
rectangle_length

OUTPUT

[1] 4

You will be naming a of objects in R, and there are a few common naming rules and conventions:

  • make names clear without being too long
  • names cannot start with a number
  • names are case sensitive. rectangle_length is different than Rectangle_length.
  • you cannot use the names of fundamental functions in R, like if, else, or for
  • avoid dots . in names
  • two common formats are snake_case and camelCase
  • be consistent, at least within a script, ideally within a whole project

Functions


Functions are lines of code that are grouped together to do something. R language has many built in functions. You can also install and import R packages which have functions and data written by other people. You can also create your own function.

A function usually gets one or more inputs called arguments. Functions will do something with the arguments. Functions often (but not always) return a value. Executing a function (‘running it’) is called calling the function.

R has a function round(), that will round a number to a certain number of decimal places. We pass in 3.14159, and it has returned the value 3. That’s because the default is to round to the nearest whole number.

R

round(3.14159)

OUTPUT

[1] 3

To learn more about a function, you can type a ? in front of the name of the function, which will bring up the official documentation for that function:

R

?round

Function documentation is written by the authors of the functions, so they can vary pretty widely in their style and readability. Description section gives you a description of what the function does. Arguments section defines all the arguments for the function and is usually worth reading thoroughly. Examples section at the end will often have some helpful examples that you can run to get a sense of what the function is doing.

args() will show the arguments of a function.

R

args(round)

OUTPUT

function (x, digits = 0) 
NULL

round() takes two arguements: x and digits. If we want a different number of digits, we can type digits=2.

R

round(x = 3.14159, digits = 2)

OUTPUT

[1] 3.14

If you provide the arguments in the exact same order as they are defined you don’t have to name them:

R

round(3.14159, 2)

OUTPUT

[1] 3.14

And if you do name the arguments, you can switch their order:

R

round(digits = 2, x = 3.14159)

OUTPUT

[1] 3.14

Data types in R


Objects can store different types of values such as numbers, letters, etc. These different types of data are called data types.

The function typeof() indicates the type of an object.

The 3 common data types we will use in this class:

  1. numeric, aka double - all numbers with and without decimals.

R

my_number <- 1
typeof(my_number)

OUTPUT

[1] "double"

R

my_number_2 <- 2.2
typeof(my_number_2)

OUTPUT

[1] "double"
  1. character - all characters. The characters must be wrapped in quotes (“” or ’’).

R

my_character <- 'dog'
typeof(my_character)

OUTPUT

[1] "character"
  1. logical - can only have two values: TRUE and FALSE. Must be capitialize.

R

my_logical <- TRUE
typeof(my_logical)

OUTPUT

[1] "logical"

Vectors


A vector is a collection of values. We can assign a series of values to a vector using the c() function. All values in a vector must be the same data type.

Create an numeric vector.

R

my_numbers <- c(1, 2, 5)
my_numbers

OUTPUT

[1] 1 2 5

R

typeof(my_numbers)

OUTPUT

[1] "double"

Create an character vector.

R

my_words <- c('the', 'dog')
my_words

OUTPUT

[1] "the" "dog"

R

typeof(my_words)

OUTPUT

[1] "character"

If you try to create a vector with multiple types, R will coerce all the values to the same type.

When there are numbers and charcters in a vector, all values are coerced to string.

R

mixed <- c(1, 2, 'three')
mixed

OUTPUT

[1] "1"     "2"     "three"

R

typeof(mixed)

OUTPUT

[1] "character"

Missing data


When dealing with data, there are times when a record does not have a value for a field. Imagine filling out a form, and leaving some of the fields blank. R represents missing data as NA, without quotes. Let’s make a numeric vector with an NA value:

R

ages <- c(25, 34, NA, 42)
ages

OUTPUT

[1] 25 34 NA 42

min() returns the minimum value in a vector. If we pass vector with NA a numeric function like min(), R won’t know what to do, so it returns NA:

R

min(ages)

OUTPUT

[1] NA

Many basic math functions use na.rm argument to remove NA values from the vector when doing the calculation.

R

min(ages, na.rm = TRUE)

OUTPUT

[1] 25

Content from Working with data


Last updated on 2024-05-24 | Edit this page

Overview

Questions

  • How do you manipulate tabular data in R?

Objectives

  • Import CSV data into R.
  • Use pipes to link steps together into pipelines.
  • Export data to a CSV file.

R Packages


R packages are extensions to the R language. R packages contain code, data, and documentation that people can download and install to add more functionality to R.

To download and install packages onto your computer, type install.packages("package_name") in the R console. Must use quotes. This function will connect to the internet and download packages from servers that have R packages. The Comprehensive R Archive Network (cran.r-project.org) is a network of web servers around the world that store R packages.

To use the package, use library(package_name) to load it. Do not use quotes. You want to install the package to your computer once, and then load it with library() in each script where you need to use it. Generally its a good idea to list all the libraries at the beginning of the script.

tidyverse

tidyverse is collection of R packages that are used for analyzing data. These packages like data in “tidy” format, which means each column represents a single field, and each row represents a single record

Importing data


File paths

When we reference other files from an R script, we need to give R precise instructions on where those files are. We do that using something called a file path.

There are two kinds of paths: absolute and relative. Absolute paths are specific to a particular computer, whereas relative paths are relative to a certain folder. For instance an absolute path is “/Users/wyk/Documents/code_stuff/CNC_coding_intro_lesson”, and relative path is “CNC_coding_intro_lesson”.

Read a file

Use library to load the needed packages.

R

library(readr)
library(lubridate)
library(dplyr)

We will use the read_csv function from readr package to read a csv of CNC iNaturalist observations, and the argument we give will be the path to the CSV file. We will store the observations in an object named inat_raw.

R

inat_raw <- read_csv('data/raw/observations-397280.csv')

OUTPUT

Rows: 171155 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (23): observed_on_string, time_observed_at, time_zone, user_login, user...
dbl  (10): id, user_id, num_identification_agreements, num_identification_di...
lgl   (5): captive_cultivated, private_place_guess, private_latitude, privat...
date  (1): observed_on

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

inat_raw is stored in memory. It appears in Environment tab. Double click on inat_raw in Environment to see all the data.

read_csv provides some info about the CSV.

  • the number of rows and columns
  • the delimiter of the file, which is how values are separated, a comma ","
  • the data types for the columns

Use glimpse() to see a information about a dataframe. Number of rows and columns. For each column, we see the name, data type (dbl for number, chr for character, lgl for logical. date is a data type from data.frame), and the first few values.

R

glimpse(inat_raw)

OUTPUT

Rows: 171,155
Columns: 39
$ id                               <dbl> 2931940, 2934641, 2934854, 2934961, 2…
$ observed_on_string               <chr> "2016-04-14 12:25:00 AM PDT", "Thu Ap…
$ observed_on                      <date> 2016-04-14, 2016-04-14, 2016-04-14, …
$ time_observed_at                 <chr> "2016-04-14 19:25:00 UTC", "2016-04-1…
$ time_zone                        <chr> "Pacific Time (US & Canada)", "Pacifi…
$ user_id                          <dbl> 151043, 10814, 8510, 80445, 80445, 80…
$ user_login                       <chr> "msmorales", "smartrf", "stonebird", …
$ user_name                        <chr> "Michael Morales", "Richard Smart (he…
$ created_at                       <chr> "2016-04-14 07:28:36 UTC", "2016-04-1…
$ updated_at                       <chr> "2021-12-26 06:58:04 UTC", "2018-05-2…
$ quality_grade                    <chr> "research", "needs_id", "needs_id", "…
$ license                          <chr> "CC-BY", "CC-BY-NC", NA, NA, NA, NA, …
$ url                              <chr> "http://www.inaturalist.org/observati…
$ image_url                        <chr> "https://inaturalist-open-data.s3.ama…
$ sound_url                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ tag_list                         <chr> NA, NA, "\"Allen's Hummingbird\" \"Se…
$ description                      <chr> "Spotted on a the wall of a planter, …
$ num_identification_agreements    <dbl> 5, 2, 0, 1, 2, 2, 1, 0, 1, 2, 1, 1, 1…
$ num_identification_disagreements <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ captive_cultivated               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ oauth_application_id             <dbl> 2, 3, NA, NA, NA, NA, 3, 3, NA, NA, N…
$ place_guess                      <chr> "Olive Lane Walk Pomona, CA 91768", "…
$ latitude                         <dbl> 34.05829, 34.01742, NA, 34.13020, 34.…
$ longitude                        <dbl> -117.8219, -118.2892, NA, -118.8226, …
$ positional_accuracy              <dbl> 4, 5, 220, NA, NA, NA, NA, 17, 55, 55…
$ private_place_guess              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ private_latitude                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ private_longitude                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ public_positional_accuracy       <dbl> 4, 5, 28888, NA, NA, NA, NA, 17, 55, …
$ geoprivacy                       <chr> NA, NA, "private", NA, NA, NA, NA, NA…
$ taxon_geoprivacy                 <chr> NA, NA, "open", NA, "open", "open", N…
$ coordinates_obscured             <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FAL…
$ positioning_method               <chr> "gps", NA, NA, NA, NA, NA, NA, NA, NA…
$ positioning_device               <chr> "gps", NA, NA, NA, NA, NA, NA, NA, NA…
$ species_guess                    <chr> "Garden Snail", "Oestroidea", "Allen'…
$ scientific_name                  <chr> "Cornu aspersum", "Oestroidea", "Sela…
$ common_name                      <chr> "Garden Snail", "Bot Flies, Blow Flie…
$ iconic_taxon_name                <chr> "Mollusca", "Insecta", "Aves", "Insec…
$ taxon_id                         <dbl> 480298, 356157, 6359, 54247, 36100, 3…

nrow() returns the number of rows. ncol() returns the number of columns dim() returns the number of rows and columns.

R

nrow(inat_raw)

OUTPUT

[1] 171155

R

ncol(inat_raw)

OUTPUT

[1] 39

R

dim(inat_raw)

OUTPUT

[1] 171155     39

names() shows the column names

R

names(inat_raw)

OUTPUT

 [1] "id"                               "observed_on_string"              
 [3] "observed_on"                      "time_observed_at"                
 [5] "time_zone"                        "user_id"                         
 [7] "user_login"                       "user_name"                       
 [9] "created_at"                       "updated_at"                      
[11] "quality_grade"                    "license"                         
[13] "url"                              "image_url"                       
[15] "sound_url"                        "tag_list"                        
[17] "description"                      "num_identification_agreements"   
[19] "num_identification_disagreements" "captive_cultivated"              
[21] "oauth_application_id"             "place_guess"                     
[23] "latitude"                         "longitude"                       
[25] "positional_accuracy"              "private_place_guess"             
[27] "private_latitude"                 "private_longitude"               
[29] "public_positional_accuracy"       "geoprivacy"                      
[31] "taxon_geoprivacy"                 "coordinates_obscured"            
[33] "positioning_method"               "positioning_device"              
[35] "species_guess"                    "scientific_name"                 
[37] "common_name"                      "iconic_taxon_name"               
[39] "taxon_id"                        

To access one column, use $ and name of the column

R

inat_raw$quality_grade

OUTPUT

  [1] "research" "needs_id" "needs_id" "research" "research" "research"
  [7] "needs_id" "needs_id" "casual"   "research" "research" "research"
 [13] "needs_id" "research" "research" "research" "research" "research"
 [19] "research" "research" "needs_id" "research" "research" "casual"  
 [25] "casual"   "casual"   "casual"   "research" "casual"   "casual"  
 [31] "casual"   "research" "research" "casual"   "research" "needs_id"
 [37] "needs_id" "research" "research" "research" "research" "research"
 [43] "casual"   "needs_id" "research" "research" "research" "needs_id"
 [49] "research" "casual"   "casual"   "casual"   "research" "research"
 [55] "needs_id" "research" "research" "research" "needs_id" "needs_id"
 [61] "research" "needs_id" "research" "research" "casual"   "needs_id"
 [67] "research" "research" "needs_id" "research" "research" "research"
 [73] "needs_id" "needs_id" "research" "research" "casual"   "research"
 [79] "research" "research" "needs_id" "research" "casual"   "needs_id"
 [85] "research" "research" "research" "research" "research" "research"
 [91] "needs_id" "research" "research" "casual"   "research" "needs_id"
 [97] "needs_id" "research" "research" "needs_id"
 [ reached getOption("max.print") -- omitted 171055 entries ]

To view all the unique values in a column, use unique()

R

unique(inat_raw$quality_grade)

OUTPUT

[1] "research" "needs_id" "casual"  

Manipulating data


One of the most important skills for working with data in R is the ability to manipulate, modify, and reshape data. The dplyr package provide a series of powerful functions for many common data manipulation tasks.

select() filter() mutate() arrange() count()

select()

select() picks certain columns of a data.frame. To use the select() function, the first argument is the name of the data.frame, and the rest of the arguments are unquoted names of the columns you want.

iNaturalist has 39 columns. We want four columns. The columns are arranged in the order we specified inside select().

R

select(inat_raw, user_login, common_name, scientific_name, observed_on)

OUTPUT

# A tibble: 171,155 × 4
   user_login    common_name                         scientific_name observed_on
   <chr>         <chr>                               <chr>           <date>     
 1 msmorales     Garden Snail                        Cornu aspersum  2016-04-14 
 2 smartrf       Bot Flies, Blow Flies, and Allies   Oestroidea      2016-04-14 
 3 stonebird     Allen's Hummingbird                 Selasphorus sa… 2016-04-14 
 4 cdegroof      California Orange-winged Grasshopp… Arphia ramona   2016-04-14 
 5 cdegroof      Western Side-blotched Lizard        Uta stansburia… 2016-04-14 
 6 cdegroof      Western Fence Lizard                Sceloporus occ… 2016-04-14 
 7 ttempel       <NA>                                Coelocnemis     2016-04-14 
 8 bradrumble    House Sparrow                       Passer domesti… 2016-04-15 
 9 deedeeflower5 Amur Carp                           Cyprinus rubro… 2016-04-14 
10 deedeeflower5 Red-eared Slider                    Trachemys scri… 2016-04-14 
# ℹ 171,145 more rows

filter()

The filter() function is used to select rows that meet certain criteria. To get all the rows where the value of common_name is equal to Western Fence Lizard, we would run the following:

R

filter(inat_raw, common_name == 'Western Fence Lizard')

OUTPUT

# A tibble: 2,970 × 39
        id observed_on_string     observed_on time_observed_at time_zone user_id
     <dbl> <chr>                  <date>      <chr>            <chr>       <dbl>
 1 2934994 2016-04-14 12:19:09    2016-04-14  2016-04-14 19:1… Pacific …   80445
 2 2935263 2016-04-14             2016-04-14  <NA>             Pacific …  216108
 3 2935420 2016-04-14             2016-04-14  <NA>             Pacific …  216108
 4 2935748 2016-04-14 14:01:29    2016-04-14  2016-04-14 21:0… Pacific …   80445
 5 2935965 Thu Apr 14 2016 12:44… 2016-04-14  2016-04-14 19:4… Pacific …  171443
 6 2938607 Thu Apr 14 2016 16:33… 2016-04-14  2016-04-14 23:3… Pacific …  146517
 7 2940103 2016-04-15 9:31:39 AM… 2016-04-15  2016-04-15 16:3… Pacific …   80984
 8 2940838 Fri Apr 15 2016 10:11… 2016-04-15  2016-04-15 17:1… Pacific …  201119
 9 2940848 Fri Apr 15 2016 10:17… 2016-04-15  2016-04-15 17:1… Pacific …  201119
10 2940855 Fri Apr 15 2016 10:42… 2016-04-15  2016-04-15 17:4… Pacific …  201119
# ℹ 2,960 more rows
# ℹ 33 more variables: user_login <chr>, user_name <chr>, created_at <chr>,
#   updated_at <chr>, quality_grade <chr>, license <chr>, url <chr>,
#   image_url <chr>, sound_url <chr>, tag_list <chr>, description <chr>,
#   num_identification_agreements <dbl>,
#   num_identification_disagreements <dbl>, captive_cultivated <lgl>,
#   oauth_application_id <dbl>, place_guess <chr>, latitude <dbl>, …

The == sign means “is equal to”. There are several other operators we can use: >, >=, <, <=, and != (not equal to).

The pipe: %>%


What happens if we want to both select() and filter() our data?

We use the pipe operator (%>%) to call multiple functions. You can insert it by using the keyboard shortcut Shift+Cmd+M (Mac) or Shift+Ctrl+M (Windows).

Get user_login, common_name, scientific_name, observed_on for all observations where common_name is ‘Western Fence Lizard’. Use filter to select rows, then use select to select columns.

R

inat_raw %>% 
  filter(common_name == 'Western Fence Lizard') %>% 
  select(user_login, common_name, scientific_name, observed_on) 

OUTPUT

# A tibble: 2,970 × 4
   user_login    common_name          scientific_name         observed_on
   <chr>         <chr>                <chr>                   <date>     
 1 cdegroof      Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 2 deedeeflower5 Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 3 deedeeflower5 Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 4 cdegroof      Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 5 lchroman      Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 6 maiz          Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 7 kimssight     Western Fence Lizard Sceloporus occidentalis 2016-04-15 
 8 sarahwenner   Western Fence Lizard Sceloporus occidentalis 2016-04-15 
 9 sarahwenner   Western Fence Lizard Sceloporus occidentalis 2016-04-15 
10 sarahwenner   Western Fence Lizard Sceloporus occidentalis 2016-04-15 
# ℹ 2,960 more rows

Pipe operator take the thing on the lefthand side and insert it as the first argument of the function on the righthand side. By putting each of our functions onto a new line, we can build a nice, readable pipeline. It can be useful to think of this as a little assembly line for our data. It starts at the top and gets piped into a filter() function, and it comes out modified somewhat. It then gets sent into the select() function, where it is further modified, and then the final product gets printed out to our console. It can also be helpful to think of %>% as meaning “and then”.

If you want to see all the records, assign the data.frame to an object.

R

temp <- inat_raw %>% 
  filter(common_name == 'Western Fence Lizard') %>% 
  select(user_login, common_name, scientific_name, observed_on) 

We can also use multiple conditions in one filter() statement.

When researchers use iNaturalist data, the normally use research grade observations. Here we will get all observations that research grade and common_name is Western Fence Lizard. use & for and.

R

inat_raw %>% 
  filter( common_name == 'Western Fence Lizard' 
         & quality_grade == 'research')  %>% 
  select(user_login, common_name, scientific_name, observed_on)

OUTPUT

# A tibble: 2,942 × 4
   user_login    common_name          scientific_name         observed_on
   <chr>         <chr>                <chr>                   <date>     
 1 cdegroof      Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 2 deedeeflower5 Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 3 deedeeflower5 Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 4 cdegroof      Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 5 lchroman      Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 6 maiz          Western Fence Lizard Sceloporus occidentalis 2016-04-14 
 7 kimssight     Western Fence Lizard Sceloporus occidentalis 2016-04-15 
 8 sarahwenner   Western Fence Lizard Sceloporus occidentalis 2016-04-15 
 9 sarahwenner   Western Fence Lizard Sceloporus occidentalis 2016-04-15 
10 sarahwenner   Western Fence Lizard Sceloporus occidentalis 2016-04-15 
# ℹ 2,932 more rows

Here we will get observations where user_login is ‘natureinla’ and common_name is ‘Western Fence Lizard’.

R

inat_raw %>% 
  filter(user_login == 'natureinla' & common_name == 'Western Fence Lizard') %>% 
  select(user_login, common_name, scientific_name, observed_on) 

OUTPUT

# A tibble: 79 × 4
   user_login common_name          scientific_name         observed_on
   <chr>      <chr>                <chr>                   <date>     
 1 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-16 
 2 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-16 
 3 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-17 
 4 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-16 
 5 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-17 
 6 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-17 
 7 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-19 
 8 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-16 
 9 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-18 
10 natureinla Western Fence Lizard Sceloporus occidentalis 2016-04-16 
# ℹ 69 more rows

Here we will get observations where common_name is ‘Western Fence Lizard’ or ‘Western Honey Bee’. use | for or.

R

inat_raw %>% 
  filter(common_name == 'Western Honey Bee' | common_name == 'Western Fence Lizard')  %>% 
  select(user_login, observed_on, common_name)

OUTPUT

# A tibble: 4,788 × 3
   user_login    observed_on common_name         
   <chr>         <date>      <chr>               
 1 cdegroof      2016-04-14  Western Fence Lizard
 2 deedeeflower5 2016-04-14  Western Fence Lizard
 3 deedeeflower5 2016-04-14  Western Fence Lizard
 4 cdegroof      2016-04-14  Western Fence Lizard
 5 lchroman      2016-04-14  Western Fence Lizard
 6 smartrf       2016-04-14  Western Honey Bee   
 7 maiz          2016-04-14  Western Fence Lizard
 8 smartrf       2016-04-15  Western Honey Bee   
 9 kimssight     2016-04-15  Western Fence Lizard
10 catherineh    2016-04-15  Western Honey Bee   
# ℹ 4,778 more rows

Sometimes we want to combine and or. We want observations from ‘cdegroof’ or ‘deedeeflower5’ for ‘Western Fence Lizard’. You can use both & and | together in a single filter.

R

temp <- inat_raw %>% 
  filter(user_login == 'cdegroof' 
         | user_login == 'deedeeflower5'
         & common_name == 'Western Fence Lizard')  %>% 
  select(user_login, common_name, scientific_name, observed_on)

You can also use multiple filter statememts.

R

temp <- inat_raw %>% 
  filter(user_login == 'cdegroof' 
         | user_login == 'deedeeflower5') %>%
  filter(common_name == 'Western Fence Lizard')  %>% 
  select(user_login, observed_on, common_name)

Cleaning up raw data, exporting dataframe


A common step during data analysis is to clean up the raw data. We fix any obvious errors, edit column names, exclude rows we do not want, and save the cleaned up data set. We do the analysis on the cleaned data set.

We want observation that match these criteria - have a species information. - have latitude or longitude. - have ‘research’ for quality_grade

Use colSums(is.na()) to count the number of rows that have NA values for each column.

R

colSums(is.na(inat_raw))

OUTPUT

                              id               observed_on_string 
                               0                                0 
                     observed_on                 time_observed_at 
                               0                             5819 
                       time_zone                          user_id 
                               0                                0 
                      user_login                        user_name 
                               0                            63304 
                      created_at                       updated_at 
                               0                                0 
                   quality_grade                          license 
                               0                            42937 
                             url                        image_url 
                               0                             2214 
                       sound_url                         tag_list 
                          170645                           164464 
                     description    num_identification_agreements 
                          149235                                0 
num_identification_disagreements               captive_cultivated 
                               0                                0 
            oauth_application_id                      place_guess 
                           66613                              440 
                        latitude                        longitude 
                             438                              438 
             positional_accuracy              private_place_guess 
                           38319                           171155 
                private_latitude                private_longitude 
                          171155                           171155 
      public_positional_accuracy                       geoprivacy 
                           34911                           158521 
                taxon_geoprivacy             coordinates_obscured 
                          129409                                0 
              positioning_method               positioning_device 
                          156084                           154467 
                   species_guess                  scientific_name 
                           25721                             1685 
                     common_name                iconic_taxon_name 
                           11164                             1846 
                        taxon_id 
                            1685 

All rows have id, observed_on, and user_id.

1685 rows don’t have scientific_name. 438 rows don’t have latitude or longitude.

table is a function from base R that can count the number of unique values in a column. Get a count for quality_grade.

R

table(inat_raw$quality_grade)

OUTPUT


  casual needs_id research 
   23194    53875    94086 

94086 rows are research grade.

use filter to select the observations we want.

!is.na will select rows that have are not NA, meaning rows that have a value. quality_grade == 'research' will select rows that are ‘research’ grade.

save the cleaned up data in a new object inat.

R

inat <- inat_raw %>% 
  filter(!is.na(latitude) &
           !is.na(longitude) &
           !is.na(scientific_name)) %>% 
  filter(quality_grade == 'research')

The original dataframe ‘inat_raw’ had 171K rows, the cleaned dataframe ‘inat’ has 93K rows.

We can double check our work.

latitude, longitude, scientific_name have zero NA.

R

colSums(is.na(inat))

OUTPUT

                              id               observed_on_string 
                               0                                0 
                     observed_on                 time_observed_at 
                               0                             3167 
                       time_zone                          user_id 
                               0                                0 
                      user_login                        user_name 
                               0                            30889 
                      created_at                       updated_at 
                               0                                0 
                   quality_grade                          license 
                               0                            21944 
                             url                        image_url 
                               0                              334 
                       sound_url                         tag_list 
                           93561                            89612 
                     description    num_identification_agreements 
                           82364                                0 
num_identification_disagreements               captive_cultivated 
                               0                                0 
            oauth_application_id                      place_guess 
                           41638                                1 
                        latitude                        longitude 
                               0                                0 
             positional_accuracy              private_place_guess 
                           22494                            93950 
                private_latitude                private_longitude 
                           93950                            93950 
      public_positional_accuracy                       geoprivacy 
                           20663                            87462 
                taxon_geoprivacy             coordinates_obscured 
                           59216                                0 
              positioning_method               positioning_device 
                           84974                            84450 
                   species_guess                  scientific_name 
                             138                                0 
                     common_name                iconic_taxon_name 
                            1626                                4 
                        taxon_id 
                               0 

quality_grade only has research.

R

table(inat$quality_grade)

OUTPUT


research 
   93950 

We want to save the cleaned up data set so we can use it later. We can save data.frame to a CSV using the write_csv() function from the readr package. The first argument is the name of the data.frame, and the second is the path to the new file we want to create, including the file extension .csv.

R

write_csv(inat, file= 'data/cleaned/observations.csv')

If we go look into our cleaned_data folder, we will see this new CSV file.

Errors in code


We are writing instructions for the computer. If there is typos, mispelling, pass in wrong arguments into functions, etc, code will not work and we will see errors. R will display the errors in red.

typo for %>

R

inat %>%
  select(user_login, observed_on, common_name) %>% 
  filter(user_login == 'natureinla')

OUTPUT

# A tibble: 1,528 × 3
   user_login observed_on common_name           
   <chr>      <date>      <chr>                 
 1 natureinla 2016-04-14  Red-eared Slider      
 2 natureinla 2016-04-14  Monarch               
 3 natureinla 2016-04-14  San Diego Gopher Snake
 4 natureinla 2016-04-14  California Towhee     
 5 natureinla 2016-04-14  Cooper's Hawk         
 6 natureinla 2016-04-14  Monarch               
 7 natureinla 2016-04-14  Allen's Hummingbird   
 8 natureinla 2016-04-15  Northern Mockingbird  
 9 natureinla 2016-04-15  House Sparrow         
10 natureinla 2016-04-15  Indian Peafowl        
# ℹ 1,518 more rows

Misspelled user_logi

R

inat %>%
  select(user_login, observed_on, common_name) %>% 
  filter(user_login == 'natureinla')

OUTPUT

# A tibble: 1,528 × 3
   user_login observed_on common_name           
   <chr>      <date>      <chr>                 
 1 natureinla 2016-04-14  Red-eared Slider      
 2 natureinla 2016-04-14  Monarch               
 3 natureinla 2016-04-14  San Diego Gopher Snake
 4 natureinla 2016-04-14  California Towhee     
 5 natureinla 2016-04-14  Cooper's Hawk         
 6 natureinla 2016-04-14  Monarch               
 7 natureinla 2016-04-14  Allen's Hummingbird   
 8 natureinla 2016-04-15  Northern Mockingbird  
 9 natureinla 2016-04-15  House Sparrow         
10 natureinla 2016-04-15  Indian Peafowl        
# ℹ 1,518 more rows

typo for =

R

inat %>%
  select(user_login, observed_on, common_name) %>% 
  filter(user_login == 'natureinla')

OUTPUT

# A tibble: 1,528 × 3
   user_login observed_on common_name           
   <chr>      <date>      <chr>                 
 1 natureinla 2016-04-14  Red-eared Slider      
 2 natureinla 2016-04-14  Monarch               
 3 natureinla 2016-04-14  San Diego Gopher Snake
 4 natureinla 2016-04-14  California Towhee     
 5 natureinla 2016-04-14  Cooper's Hawk         
 6 natureinla 2016-04-14  Monarch               
 7 natureinla 2016-04-14  Allen's Hummingbird   
 8 natureinla 2016-04-15  Northern Mockingbird  
 9 natureinla 2016-04-15  House Sparrow         
10 natureinla 2016-04-15  Indian Peafowl        
# ℹ 1,518 more rows

extra )

R

inat %>%
  select(user_login, observed_on, common_name) %>% 
  filter (user_login == 'natureinla')

OUTPUT

# A tibble: 1,528 × 3
   user_login observed_on common_name           
   <chr>      <date>      <chr>                 
 1 natureinla 2016-04-14  Red-eared Slider      
 2 natureinla 2016-04-14  Monarch               
 3 natureinla 2016-04-14  San Diego Gopher Snake
 4 natureinla 2016-04-14  California Towhee     
 5 natureinla 2016-04-14  Cooper's Hawk         
 6 natureinla 2016-04-14  Monarch               
 7 natureinla 2016-04-14  Allen's Hummingbird   
 8 natureinla 2016-04-15  Northern Mockingbird  
 9 natureinla 2016-04-15  House Sparrow         
10 natureinla 2016-04-15  Indian Peafowl        
# ℹ 1,518 more rows

Exercise 1

Get your observations for one species. - The data.frame should have user_login, observed_on, common-name - Use select(), filter(), &

R

inat %>% 
  filter(user_login == 'natureinla' & common_name == 'Red-eared Slider') %>% 
  select(user_login, observed_on, common_name) 

OUTPUT

# A tibble: 13 × 3
   user_login observed_on common_name     
   <chr>      <date>      <chr>           
 1 natureinla 2016-04-14  Red-eared Slider
 2 natureinla 2016-04-14  Red-eared Slider
 3 natureinla 2017-04-15  Red-eared Slider
 4 natureinla 2017-04-15  Red-eared Slider
 5 natureinla 2017-04-16  Red-eared Slider
 6 natureinla 2017-04-14  Red-eared Slider
 7 natureinla 2017-04-17  Red-eared Slider
 8 natureinla 2017-04-18  Red-eared Slider
 9 natureinla 2017-04-18  Red-eared Slider
10 natureinla 2017-04-14  Red-eared Slider
11 natureinla 2018-04-30  Red-eared Slider
12 natureinla 2018-04-30  Red-eared Slider
13 natureinla 2019-04-27  Red-eared Slider

Making new columns with mutate()


Another common task is creating a new column based on values in existing columns. For example, we could add a new column for year.

Use mutate() to a column. We pass in the name of the new column, and the value of the column.

Use year() from lubridate on a date column to get the year.

This code will get the year from ‘observed_on’, and create a ‘year’ column.

R

temp <- inat %>% 
  mutate(year = year(observed_on))

Get observations for 2020.

R

inat %>% 
  mutate(year = year(observed_on)) %>%
  filter(year == 2020)

OUTPUT

# A tibble: 10,659 × 40
         id observed_on_string    observed_on time_observed_at time_zone user_id
      <dbl> <chr>                 <date>      <chr>            <chr>       <dbl>
 1 43036534 Fri Apr 24 2020 00:0… 2020-04-24  2020-04-24 07:0… Pacific …  146517
 2 43036989 Fri Apr 24 2020 00:0… 2020-04-24  2020-04-24 07:0… Pacific …   74669
 3 43037631 Fri Apr 24 2020 00:1… 2020-04-24  2020-04-24 07:1… Pacific …  403949
 4 43037703 Fri Apr 24 2020 00:1… 2020-04-24  2020-04-24 07:1… Pacific …  403949
 5 43037736 Fri Apr 24 2020 00:0… 2020-04-24  2020-04-24 07:0… Pacific …  403949
 6 43037745 Fri Apr 24 2020 00:1… 2020-04-24  2020-04-24 07:1… Pacific … 2556338
 7 43037824 2020-04-24 12:05:06 … 2020-04-24  2020-04-24 07:0… Pacific … 1628946
 8 43037956 Fri Apr 24 2020 00:1… 2020-04-24  2020-04-24 07:1… Pacific …   74669
 9 43037961 Fri Apr 24 2020 00:2… 2020-04-24  2020-04-24 07:2… Pacific … 2556338
10 43038195 Fri Apr 24 2020 00:2… 2020-04-24  2020-04-24 07:2… Pacific … 2556338
# ℹ 10,649 more rows
# ℹ 34 more variables: user_login <chr>, user_name <chr>, created_at <chr>,
#   updated_at <chr>, quality_grade <chr>, license <chr>, url <chr>,
#   image_url <chr>, sound_url <chr>, tag_list <chr>, description <chr>,
#   num_identification_agreements <dbl>,
#   num_identification_disagreements <dbl>, captive_cultivated <lgl>,
#   oauth_application_id <dbl>, place_guess <chr>, latitude <dbl>, …

Exercise 2

  1. Create a data.frame with all of your observations from the last year.
  • Use select() , filter()
  • Use mutate() and year() to add year column
  • The data.frame should have user_login, observed_on, and common-name.

R

inat %>% 
   mutate(year = year(observed_on)) %>%
  filter(user_login == 'natureinla' & year == 2023) %>%
  select(user_login, observed_on, common_name) 

OUTPUT

# A tibble: 3 × 3
  user_login observed_on common_name             
  <chr>      <date>      <chr>                   
1 natureinla 2023-04-29  Thick-leaved Yerba Santa
2 natureinla 2023-04-29  Big Berry Manzanita     
3 natureinla 2023-04-29  chamise                 

Count the number of rows with count()


Use count() from dplyr to count the number of values for one or more columns.

Let’s try counting of all our observations by year. Use mutate to add a year column. Use count to count the number of observations for each year. By default, count will add a new column caled n.

R

inat %>% 
  mutate(year = year(observed_on)) %>%
  count(year)  

OUTPUT

# A tibble: 8 × 2
   year     n
  <dbl> <int>
1  2016  5791
2  2017  9354
3  2018 10855
4  2019 17950
5  2020 10659
6  2021 13051
7  2022 11924
8  2023 14366

We can specify the name of the count column by passing in name to count()

R

inat %>% 
  mutate(year = year(observed_on)) %>%
  count(year, name='obs_count')  

OUTPUT

# A tibble: 8 × 2
   year obs_count
  <dbl>     <int>
1  2016      5791
2  2017      9354
3  2018     10855
4  2019     17950
5  2020     10659
6  2021     13051
7  2022     11924
8  2023     14366

Let’s count the number of observations for each species. We will pass in both ‘common_name’ and ‘scientific_name’ because some species don’t have a common_name.

R

inat %>% 
  count(common_name, scientific_name, name='obs_count')   

OUTPUT

# A tibble: 3,675 × 3
   common_name                 scientific_name         obs_count
   <chr>                       <chr>                       <int>
 1 Abert's Thread-waisted Wasp Ammophila aberti                1
 2 Accipiters                  Accipiter                       2
 3 Acmon Blue                  Icaricia acmon                 35
 4 Acorn Woodpecker            Melanerpes formicivorus       256
 5 Acton's Brittlebush         Encelia actoni                 11
 6 Acute Bladder Snail         Physella acuta                  8
 7 Adams Mussel                Brachidontes adamsianus         4
 8 African Asparagus           Asparagus aethiopicus          22
 9 African Clawed Frog         Xenopus laevis                  1
10 African Cluster Bug         Agonoscelis puberula           13
# ℹ 3,665 more rows

It’s often useful to take a look at the results in some order, like the lowest count to highest. We can use the arrange() function for that. By default, arrange will return values from lowest to highest.

R

inat %>% 
  count(common_name, scientific_name, name='obs_count')   %>%
  arrange(obs_count)

OUTPUT

# A tibble: 3,675 × 3
   common_name                        scientific_name               obs_count
   <chr>                              <chr>                             <int>
 1 Abert's Thread-waisted Wasp        Ammophila aberti                      1
 2 African Clawed Frog                Xenopus laevis                        1
 3 African boxthorn                   Lycium ferocissimum                   1
 4 Almond                             Prunus amygdalus                      1
 5 Alpine Brown Sunken Disk Lichen    Bellemerea alpina                     1
 6 American Black-crowned Night Heron Nycticorax nycticorax hoactli         1
 7 American Dewdrop Spider            Argyrodes elevatus                    1
 8 American Dipper                    Cinclus mexicanus                     1
 9 American Softshells                Apalone                               1
10 American Sunflower Moth            Homoeosoma electella                  1
# ℹ 3,665 more rows

If we want to reverse the order, we can wrap the column name in desc():

R

inat %>% 
  count(common_name, scientific_name, name='obs_count') %>%
  arrange(desc(obs_count)) 

OUTPUT

# A tibble: 3,675 × 3
   common_name            scientific_name         obs_count
   <chr>                  <chr>                       <int>
 1 Western Fence Lizard   Sceloporus occidentalis      2936
 2 Western Honey Bee      Apis mellifera               1803
 3 Fox Squirrel           Sciurus niger                1285
 4 House Finch            Haemorhous mexicanus         1067
 5 Mourning Dove          Zenaida macroura             1034
 6 Mallard                Anas platyrhynchos            810
 7 House Sparrow          Passer domesticus             800
 8 Convergent Lady Beetle Hippodamia convergens         788
 9 California Towhee      Melozone crissalis            747
10 Northern Mockingbird   Mimus polyglottos             719
# ℹ 3,665 more rows

use slice() to return only certain number of records slice(start:end)

Top ten species with the most observations.

R

inat %>% 
  count(common_name, scientific_name, name='obs_count') %>%
  arrange(desc(obs_count))  %>% 
  slice(1:10)

OUTPUT

# A tibble: 10 × 3
   common_name            scientific_name         obs_count
   <chr>                  <chr>                       <int>
 1 Western Fence Lizard   Sceloporus occidentalis      2936
 2 Western Honey Bee      Apis mellifera               1803
 3 Fox Squirrel           Sciurus niger                1285
 4 House Finch            Haemorhous mexicanus         1067
 5 Mourning Dove          Zenaida macroura             1034
 6 Mallard                Anas platyrhynchos            810
 7 House Sparrow          Passer domesticus             800
 8 Convergent Lady Beetle Hippodamia convergens         788
 9 California Towhee      Melozone crissalis            747
10 Northern Mockingbird   Mimus polyglottos             719

Exercise 3

  1. Create a data.frame with that counts your observation by year
  • Use filter() and count()
  • Use mutate() and year() to add year column

R

inat %>% 
  mutate(year = year(observed_on)) %>%
  filter(user_login == 'natureinla') %>%
  count(year, name='obs_count')   

OUTPUT

# A tibble: 7 × 2
   year obs_count
  <dbl>     <int>
1  2016       490
2  2017       606
3  2018       223
4  2019       195
5  2020         9
6  2021         2
7  2023         3

Content from Data visualization with ggplot2


Last updated on 2024-05-24 | Edit this page

Overview

Questions

  • How do we create graphs using R?

Objectives

  • Learn how to create bar and line charts using ggplot2
  • Learn how to customize the appearance of the charts

R

library(ggplot2)
library(readr)
library(dplyr)
library(lubridate)

Creating graphs


We are going to be using functions from the ggplot2 package to create visualizations. ggplot plots are built step by step by adding new layers, which allows for extensive customization of plots.

We call ggplot() function, and pass in data and mappings. Then we call a geom_ function to create the plot.

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()

Setup


First, read data from the cleaned iNaturalist observation file.

R

inat <- read_csv('data/cleaned/observations.csv')

OUTPUT

Rows: 93950 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (23): observed_on_string, time_observed_at, time_zone, user_login, user...
dbl  (10): id, user_id, num_identification_agreements, num_identification_di...
lgl   (5): captive_cultivated, private_place_guess, private_latitude, privat...
date  (1): observed_on

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Bar chart


Create a bar chart that shows the number of observations per year.

First, add year column to iNaturalist data.

R

inat_year <- inat %>% 
  mutate(year = year(observed_on))

Pass the data to ggplot.

R

ggplot(data = inat_year)

We need tell ggplot how to process the data. We tell ggplot how to map the data to various plot elements, such as x/y axis, size, or color by using the aes() function.
For bar charts, we need to tell what column to use for the x axis. We want to create a plot with years on the x axis so we use mapping = aes(x = year). ggplot will count the number of rows for each year, and use the count for y axis.

R

ggplot(data = inat_year, mapping = aes(x = year))

Next we need to specify how we want the data to be displayed. We do this using geom_ functions, which specify the type of geometry we want, such as points, lines, or bars. We use geom_bar() to create a vertical bar plot.

We can add a geom_bar() layer to our plot by using the + sign. We indent onto a new line to make it easier to read, and we have to end the first line with the + sign.

R

ggplot(data = inat_year, mapping = aes(x = year)) +
  geom_bar()

If we want year on x axis, and count on y axis, use coord_flip()

R

ggplot(data = inat_year, mapping = aes(x = year)) +
  geom_bar() +
  coord_flip()

Line chart


Create a line chart that shows the number of observations per year.

For line charts, we need both x and y variables. Create a dataframe that count the number of observations by year.

R

inat_year_count <- inat %>% 
  mutate(year = year(observed_on)) %>%
  count(year, name='obs_count')  

inat_year_count

OUTPUT

# A tibble: 8 × 2
   year obs_count
  <dbl>     <int>
1  2016      5791
2  2017      9354
3  2018     10855
4  2019     17950
5  2020     10659
6  2021     13051
7  2022     11924
8  2023     14366

We use year on the x axis and obs_count on the y axis. And we use geom_line() for to create a line chart.

R

ggplot(data = inat_year_count, 
       mapping = aes(x = year, y=obs_count)) +
  geom_line()

More bar plots


To create bar chart when we already have x and y, use geom_col()

We want year on the x axis, and count on the y axis.

R

ggplot(data = inat_year_count, 
       mapping = aes(x = year, y = obs_count)) +
  geom_col()

Exercise 1

Create a chart with all your observations for each year - Use mutate() and filter() -

R

my_year  <- inat %>% 
  mutate(year = year(observed_on))  %>% 
  filter(user_login == 'natureinla')

  
ggplot(data = my_year, mapping = aes(x = year)) +
  geom_bar()

Changing aesthetics


Building ggplot plots is often an iterative process, so we’ll continue developing the column plot we just made. We can change the color of the bars using fill='color'.

Use colors() to get a list of the 657 colors in R.

R

colors()

OUTPUT

  [1] "white"           "aliceblue"       "antiquewhite"    "antiquewhite1"  
  [5] "antiquewhite2"   "antiquewhite3"   "antiquewhite4"   "aquamarine"     
  [9] "aquamarine1"     "aquamarine2"     "aquamarine3"     "aquamarine4"    
 [13] "azure"           "azure1"          "azure2"          "azure3"         
 [17] "azure4"          "beige"           "bisque"          "bisque1"        
 [21] "bisque2"         "bisque3"         "bisque4"         "black"          
 [25] "blanchedalmond"  "blue"            "blue1"           "blue2"          
 [29] "blue3"           "blue4"           "blueviolet"      "brown"          
 [33] "brown1"          "brown2"          "brown3"          "brown4"         
 [37] "burlywood"       "burlywood1"      "burlywood2"      "burlywood3"     
 [41] "burlywood4"      "cadetblue"       "cadetblue1"      "cadetblue2"     
 [45] "cadetblue3"      "cadetblue4"      "chartreuse"      "chartreuse1"    
 [49] "chartreuse2"     "chartreuse3"     "chartreuse4"     "chocolate"      
 [53] "chocolate1"      "chocolate2"      "chocolate3"      "chocolate4"     
 [57] "coral"           "coral1"          "coral2"          "coral3"         
 [61] "coral4"          "cornflowerblue"  "cornsilk"        "cornsilk1"      
 [65] "cornsilk2"       "cornsilk3"       "cornsilk4"       "cyan"           
 [69] "cyan1"           "cyan2"           "cyan3"           "cyan4"          
 [73] "darkblue"        "darkcyan"        "darkgoldenrod"   "darkgoldenrod1" 
 [77] "darkgoldenrod2"  "darkgoldenrod3"  "darkgoldenrod4"  "darkgray"       
 [81] "darkgreen"       "darkgrey"        "darkkhaki"       "darkmagenta"    
 [85] "darkolivegreen"  "darkolivegreen1" "darkolivegreen2" "darkolivegreen3"
 [89] "darkolivegreen4" "darkorange"      "darkorange1"     "darkorange2"    
 [93] "darkorange3"     "darkorange4"     "darkorchid"      "darkorchid1"    
 [97] "darkorchid2"     "darkorchid3"     "darkorchid4"     "darkred"        
 [ reached getOption("max.print") -- omitted 557 entries ]

R

ggplot(data = inat_year, mapping = aes(x = year)) +
  geom_bar(fill='aquamarine')

We can also use 6 digit hex color. You can use online tools to get hex colors. https://html-color.codes

R

ggplot(data = inat_year, mapping = aes(x = year)) +
  geom_bar(fill='#75cd5e')

Adding another variable


iNaturalist has af field called iconic_taxon_name that assigns each taxa name to a some commonly known groups.

R

unique(inat$iconic_taxon_name)

OUTPUT

 [1] "Mollusca"       "Insecta"        "Reptilia"       "Aves"          
 [5] "Mammalia"       "Plantae"        "Animalia"       "Arachnida"     
 [9] "Amphibia"       "Fungi"          "Chromista"      "Actinopterygii"
[13] NA               "Protozoa"      

Create charts that show the observations per year, and subdivide each year by iconic_taxon_name. Give each iconic_taxon_name a different color.

Since we’re now mapping a variable (iconic_taxon_name.) to a component of the ggplot2 plot (fill), we need to put the argument inside aes().

Create a bar chart that shows iconic_taxon_name by color.

R

ggplot(data = inat_year, mapping = aes(x = year, fill=iconic_taxon_name)) +
  geom_bar()

We create a new dataframe that counts the number observations per year and iconic_taxon_name. Use mutate() and year() to add a year column. We want count by both year and iconic_taxon_name. We want the column to be called obs_count.

R

inat_year_iconic_count <- inat %>% 
  mutate(year = year(observed_on))  %>%
  count(year, iconic_taxon_name, name='obs_count') 

inat_year_iconic_count

OUTPUT

# A tibble: 107 × 3
    year iconic_taxon_name obs_count
   <dbl> <chr>                 <int>
 1  2016 Actinopterygii            1
 2  2016 Amphibia                 87
 3  2016 Animalia                 87
 4  2016 Arachnida                99
 5  2016 Aves                    976
 6  2016 Chromista                 9
 7  2016 Fungi                    24
 8  2016 Insecta                1325
 9  2016 Mammalia                192
10  2016 Mollusca                183
# ℹ 97 more rows

Create a line chart that shows iconic_taxon_name by color.

R

inat_year_iconic_count %>% 
  ggplot(aes(x = year, y = obs_count, color = iconic_taxon_name)) +
  geom_line()

Changing scales


The default color scheme isn’t friendly to viewers with colorblindness. ggplot2 comes with quite a few other color scales, including viridis scales, which are designed to be colorblind and grayscale friendly. See a list of color scales. https://ggplot2.tidyverse.org/reference/scale_colour_continuous.html

We can change scales by adding scale_ functions to our plots:

R

ggplot(data = inat_year, mapping = aes(x = year, fill=iconic_taxon_name)) +
  geom_bar() +
  scale_fill_viridis_d()

Changing themes


we can assign a plot to an object

R

myplot <- ggplot(data = inat_year, mapping = aes(x = year)) +
  geom_bar(fill='#75cd5e')

myplot

We can change the overall appearance using theme_ functions. Let’s try a black-and-white theme by adding theme_bw() to our plot:

R

myplot + 
  theme_bw()

To see a list of available themes in ggplot, visit https://ggplot2.tidyverse.org/reference/index.html#themes

To individually change parts of a plot, we can use the theme() function, which can take many different arguments to change things about the text, grid lines, background color, and more.

Let’s try changing the size of the text on our axis titles. We can do this by specifying that the axis.title should be an element_text() with size set to 14.

R

myplot +
  theme_bw() +
  theme(axis.title = element_text(size = 14))

Another change we might want to make is to remove the vertical grid lines. To do this, inside theme(), we will change the panel.grid.major.x to an element_blank().

R

myplot +
  theme_bw() +
  theme(axis.title = element_text(size = 14), 
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

Because there are so many possible arguments to the theme() function, it can sometimes be hard to find the right one. Here are some tips for figuring out how to modify a plot element:

  • type out theme(), put your cursor between the parentheses, and hit Tab to bring up a list of arguments
    • you can scroll through the arguments, or start typing, which will shorten the list of potential matches
  • like many things in the tidyverse, similar argument start with similar names
    • there are axis, legend, panel, plot, and strip arguments
  • arguments have hierarchy
    • text controls all text in the whole plot
    • axis.title controls the text for the axis titles
    • axis.title.x controls the text for the x axis title

Changing labels


We customize the axis labels and add a chart title labs() function.

R

myplot +
  theme_bw() +
  theme(axis.title = element_text(size = 14), 
        plot.title = element_text(face = "bold", size = 20)) +
  labs(title = "CNC Los Angeles",
       subtitle="Observations per year",
       x = "Year",
       y = "Observations")

Exercise 2: Customizing a plot

Take the chart from exercise 1 (your observations for each year) and assign it to an object.

Customize the appearance. Here are some ideas. - add a title - choose the one of the built in themes - change the axis titles - change the colors of the bar or line

R

my_yearly_plot  <- inat %>% 
  mutate(year = year(observed_on))  %>% 
  filter(user_login == 'natureinla') %>% 
  ggplot(data = my_year, mapping = aes(x = year)) +
  geom_bar(fill='#E0A90A')

my_yearly_plot +
  theme_classic() +
  labs(title = "CNC Los Angeles",  x = "Year",  y = "Observations") 

Exporting plots


Once we are happy with ourplot, we can export the plot.

Assign the plot to an object. Then run ggsave() to save our plot. The first argument we give is the path to the file we want to save, including the correct file extension. You can save as jpb, pdf, tiff, png. Next, we tell it the name of the plot object we want to save. We can also specify things like the width and height of the plot in inches.

R

finalplot <- myplot +
  theme_bw() +
  theme(axis.title = element_text(size = 14), 
        plot.title = element_text(face = "bold", size = 20)) +
  labs(title = "CNC Los Angeles",
       subtitle="Observations per year",
       x = "Year",
       y = "Observations")

R

ggsave(filename = 'data/cleaned/observations_per_year.jpg', plot = finalplot,  height = 6, width = 8)

Content from Creating maps


Last updated on 2024-05-24 | Edit this page

Overview

Questions

  • How do we create maps using R?

Objectives

  • Learn how to create static maps with ggplot2
  • Learn how to create interactive maps with mapview
  • Learn how to plot iNaturalist observations on a map

R

library(sf)
library(dplyr)
library(readr)
library(ggplot2)
library(mapview)

Geographic data


There are various file formats for geographic data. Shape files for GIS applications, KML for Google maps, geojson for web applications.

You can get boundaries for countries, states, cities, etc from various sources. I googled “Los Angeles county boundary shape” which had a link to “County Boundary | City of Los Angeles Hub - LA GeoHub” https://geohub.lacity.org/datasets/10f1e37c065347e693cf4e8ee753c09b I downloaded the shapefile for LA county.

You can also create your boundaries using GIS applications or GIS web applications.

Mapping iNaturalist data


iNaturalist data includes latitude and longitude, which means we can put the observations in a map. There are several packages to create maps. We will use ggplot and mapview packages.

First, read data from the cleaned iNaturalist observation file.

R

inat <- read_csv('data/cleaned/observations.csv')

OUTPUT

Rows: 93950 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (23): observed_on_string, time_observed_at, time_zone, user_login, user...
dbl  (10): id, user_id, num_identification_agreements, num_identification_di...
lgl   (5): captive_cultivated, private_place_guess, private_latitude, privat...
date  (1): observed_on

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

See all the column names. “latitude” and “longitude” are the column names we need.

R

names(inat)

OUTPUT

 [1] "id"                               "observed_on_string"              
 [3] "observed_on"                      "time_observed_at"                
 [5] "time_zone"                        "user_id"                         
 [7] "user_login"                       "user_name"                       
 [9] "created_at"                       "updated_at"                      
[11] "quality_grade"                    "license"                         
[13] "url"                              "image_url"                       
[15] "sound_url"                        "tag_list"                        
[17] "description"                      "num_identification_agreements"   
[19] "num_identification_disagreements" "captive_cultivated"              
[21] "oauth_application_id"             "place_guess"                     
[23] "latitude"                         "longitude"                       
[25] "positional_accuracy"              "private_place_guess"             
[27] "private_latitude"                 "private_longitude"               
[29] "public_positional_accuracy"       "geoprivacy"                      
[31] "taxon_geoprivacy"                 "coordinates_obscured"            
[33] "positioning_method"               "positioning_device"              
[35] "species_guess"                    "scientific_name"                 
[37] "common_name"                      "iconic_taxon_name"               
[39] "taxon_id"                        

We use the sf package to add geographic data to our dataframe. st_as_sf() from sf package will take the longitude and latitude and add a geometry column that we can use for mapping.

  • We pass in longitude and latitude columns to coors argument. Must wrap longitude and latitude in quotes.
  • crs is coordinate reference system.
  • remove=FALSE will keep the cooridate columns in the dataframe

R

temp <- inat %>% 
  st_as_sf(coords = c("longitude", "latitude"),   crs = 4326, remove=FALSE)

use names() to get a list of all the columns. A geometry column was added.

R

names(temp)

OUTPUT

 [1] "id"                               "observed_on_string"              
 [3] "observed_on"                      "time_observed_at"                
 [5] "time_zone"                        "user_id"                         
 [7] "user_login"                       "user_name"                       
 [9] "created_at"                       "updated_at"                      
[11] "quality_grade"                    "license"                         
[13] "url"                              "image_url"                       
[15] "sound_url"                        "tag_list"                        
[17] "description"                      "num_identification_agreements"   
[19] "num_identification_disagreements" "captive_cultivated"              
[21] "oauth_application_id"             "place_guess"                     
[23] "latitude"                         "longitude"                       
[25] "positional_accuracy"              "private_place_guess"             
[27] "private_latitude"                 "private_longitude"               
[29] "public_positional_accuracy"       "geoprivacy"                      
[31] "taxon_geoprivacy"                 "coordinates_obscured"            
[33] "positioning_method"               "positioning_device"              
[35] "species_guess"                    "scientific_name"                 
[37] "common_name"                      "iconic_taxon_name"               
[39] "taxon_id"                         "geometry"                        

use select to pick which columns to use.

R

inat_map <- inat %>% 
  st_as_sf(coords = c("longitude", "latitude"),   crs = 4326, remove=FALSE) %>% 
  select(id, user_login, common_name, scientific_name, observed_on,  url, longitude, latitude, geometry) 

static map


Use ggplot to plot the observations. geom_sf will use geometry column to produce a map.

R

ggplot() +
  geom_sf(data = inat_map)  

There are some observations that are outside of Los Angeles. Use filter to select observations in LA.

R

inat_map <- inat_map %>% 
  filter(latitude < 40)  

create map with ggplot.

R

ggplot() +
  geom_sf(data = inat_map)  

Use dim() to show the number of rows and columns. There are over 90K rows.

R

dim(inat_map)

OUTPUT

[1] 93948     9

interactive map


use mapview package to create interactive maps.

Since there are over 90K rows, an interactive map will be very slow. I suggest not using mapview if there are lots of rows.

To speed up the interactive map, let’s filter the list of observations. Get all observations for Western Fence Lizard.

R

inat_lizard <- inat_map %>% 
  filter(common_name == 'Western Fence Lizard')

Use dim to get number of rows. About 3000 rows.

R

dim(inat_lizard)

OUTPUT

[1] 2936    9

Create interactive map. You can zoom in and out. Click on observation to see the info.

R

mapview(inat_lizard)

working with other geographic files


Let’s add LA county boundaries to the map.

I downloaded the LA county boundaries from https://geohub.lacity.org/datasets/lacounty::county-boundaries/explore

use read_sf() from sf package to read the shape file.

R

la_county <- read_sf('data/raw/County_Boundary/County_Boundary.shp')
                     
                    
la_county

OUTPUT

Simple feature collection with 7 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -118.9446 ymin: 32.79521 xmax: -117.6464 ymax: 34.8233
Geodetic CRS:  WGS 84
# A tibble: 7 × 18
   CITY CITY_ID CITY_TYPE      CITY_NAME     CITY_LABEL COLOR_CODE ABBR  CITY_NO
  <int>   <int> <chr>          <chr>         <chr>           <int> <chr>   <int>
1   250      31 Unincorporated Unincorporat… Unincorpo…          1 UNIN        0
2   250      31 Unincorporated Unincorporat… Unincorpo…          1 UNIN        0
3   250      31 Unincorporated Unincorporat… Unincorpo…          1 UNIN        0
4   250      31 Unincorporated Unincorporat… Unincorpo…          1 UNIN        0
5   250      31 Unincorporated Unincorporat… Unincorpo…          1 UNIN        0
6   250      31 Unincorporated Unincorporat… Unincorpo…          1 UNIN        0
7   250      31 Unincorporated Unincorporat… Unincorpo…          1 UNIN        0
# ℹ 10 more variables: DESCRIPTN <chr>, URL <chr>, PHONE <chr>,
#   OF_AREA_SM <int>, FEAT_TYPE <chr>, COMMENT <chr>, SUB_TYPE <int>,
#   COLOR <chr>, OBJECTID <int>, geometry <MULTIPOLYGON [°]>

add LA County to maps.

R

ggplot() +
  geom_sf(data = la_county)  +
  geom_sf(data = inat_lizard) 

R

mapview(la_county) +
  mapview(inat_lizard) 

Exploring iNaturlist data


Lets look for all iNaturalist observations made in Exposition Park.

I downloaded the boundaries for Exposition Park using this site https://wykhuh.github.io/draw-map-boundaries/

R

expo_park <- st_read('data/raw/boundaries_expo_park_area.geojson')  %>% 
  st_transform(4326)

OUTPUT

Reading layer `boundaries_expo_park_area' from data source 
  `/Users/wyk/Development/science/city_nature_challenge/NHMLA_workshop/CNC-coding-workshop/site/built/data/raw/boundaries_expo_park_area.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 2 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -118.2915 ymin: 34.01096 xmax: -118.2829 ymax: 34.01806
Geodetic CRS:  WGS 84

plot map of Expo Park.

R

ggplot() +
  geom_sf(data = expo_park) 

R

mapview(expo_park) 

We want to get observation inside Expo Park.

You should check if the crs for the inaturalist data and the Expo Park are the same

R

st_crs(expo_park) == st_crs(inat_map)

OUTPUT

[1] TRUE

Use st_intersection() to get all observations that inside of Exposition Park.

R

inat_expo <- inat_map %>% st_intersection(expo_park)

WARNING

Warning: attribute variables are assumed to be spatially constant throughout
all geometries

Use dim to get row and column count. 93K in LA county. 1191 observation in Expo Park.

R

dim(inat_map)

OUTPUT

[1] 93948     9

R

dim(inat_expo)

OUTPUT

[1] 1191   11

Create map of all observations in Expo Park.

R

ggplot() +
  geom_sf(data = expo_park)  +
  geom_sf(data = inat_expo) 

R

mapview(expo_park) +
  mapview(inat_expo) 

Use alpha.regions to set opacity. use col.regions to set color.

R

mapview(expo_park,   alpha.regions=0.3, col.regions="#333333") +
  mapview(inat_expo)