Exploring and understanding data
Last updated on 2024-05-24 | Edit this page
Overview
Questions
- How does R store and represent data?
Objectives
- Solve simple arithmetic operations in R.
- Use comments to inform script.
- Assign values to objects in R.
- Call functions and use arguments to change their default options.
- Understand vector types and missing data
Setup
Simple arithmetic operations
You can use R to do simple calculations
R
3 * 5
OUTPUT
[1] 15
R
3 + 5
OUTPUT
[1] 8
The results will be shown in the console.
Creating objects in R
A fundemental part of programming is assigning values to named
objects. The value is stored in memory, and we can refer to value using
the name of the object. To create an object, we need to give it a name
followed by the assignment operator <-
, and the value we
want to give it.
R
rectangle_length <- 3
What we are doing here is taking the result of the code on the right
side of the arrow, and assigning it to an object whose name is on the
left side of the arrow. So, after executing
rectangle_length <- 3
, the value of
rectangle_length
is 3
.
In RStudio, typing Alt + - (push Alt
at the same time as the - key) will write <-
in a single keystroke in a PC, while typing Option +
- (push Option at the same time as the
- key) does the same in a Mac.
Objects are displayed in Environment panel. Objects are stored in R memory, and can be accessed by typing the name of the object. If you restart R or RStudio, all the objects are deleted from memory.
R
rectangle_length
OUTPUT
[1] 3
Let’s create second object.
R
rectangle_width <- 5
Now that R has rectangle_length
and
rectangle_width
in memory, we can do arithmetic with
it.
R
rectangle_length * rectangle_width
OUTPUT
[1] 15
R
rectangle_length + rectangle_width
OUTPUT
[1] 8
We can also store the results in an object.
R
rectangle_area <- rectangle_length * rectangle_width
When assigning a value to an object, R does not print anything. You can force R to print the value by typing the object name:
R
rectangle_area <- rectangle_length * rectangle_width # doesn't print anything
rectangle_area # typing the name of the object prints the value of the object
OUTPUT
[1] 15
We can also change an object’s value by assigning it a new one:
R
rectangle_length <- 4
rectangle_length
OUTPUT
[1] 4
You will be naming a of objects in R, and there are a few common naming rules and conventions:
- make names clear without being too long
- names cannot start with a number
- names are case sensitive. rectangle_length is different than Rectangle_length.
- you cannot use the names of fundamental functions in R, like
if
,else
, orfor
- avoid dots
.
in names - two common formats are
snake_case
andcamelCase
- be consistent, at least within a script, ideally within a whole project
Functions
Functions are lines of code that are grouped together to do something. R language has many built in functions. You can also install and import R packages which have functions and data written by other people. You can also create your own function.
A function usually gets one or more inputs called arguments. Functions will do something with the arguments. Functions often (but not always) return a value. Executing a function (‘running it’) is called calling the function.
R has a function round()
, that will round a number to a
certain number of decimal places. We pass in 3.14159
, and
it has returned the value 3
. That’s because the default is
to round to the nearest whole number.
R
round(3.14159)
OUTPUT
[1] 3
To learn more about a function, you can type a ?
in
front of the name of the function, which will bring up the official
documentation for that function:
R
?round
Function documentation is written by the authors of the functions, so they can vary pretty widely in their style and readability. Description section gives you a description of what the function does. Arguments section defines all the arguments for the function and is usually worth reading thoroughly. Examples section at the end will often have some helpful examples that you can run to get a sense of what the function is doing.
args()
will show the arguments of a function.
R
args(round)
OUTPUT
function (x, digits = 0)
NULL
round()
takes two arguements: x and digits. If we want a
different number of digits, we can type digits=2
.
R
round(x = 3.14159, digits = 2)
OUTPUT
[1] 3.14
If you provide the arguments in the exact same order as they are defined you don’t have to name them:
R
round(3.14159, 2)
OUTPUT
[1] 3.14
And if you do name the arguments, you can switch their order:
R
round(digits = 2, x = 3.14159)
OUTPUT
[1] 3.14
Data types in R
Objects can store different types of values such as numbers, letters, etc. These different types of data are called data types.
The function typeof()
indicates the type of an
object.
The 3 common data types we will use in this class:
- numeric, aka double - all numbers with and without decimals.
R
my_number <- 1
typeof(my_number)
OUTPUT
[1] "double"
R
my_number_2 <- 2.2
typeof(my_number_2)
OUTPUT
[1] "double"
- character - all characters. The characters must be wrapped in quotes (“” or ’’).
R
my_character <- 'dog'
typeof(my_character)
OUTPUT
[1] "character"
- logical - can only have two values: TRUE and FALSE. Must be capitialize.
R
my_logical <- TRUE
typeof(my_logical)
OUTPUT
[1] "logical"
Vectors
A vector is a collection of values. We can assign a series of values
to a vector using the c()
function. All values in a vector
must be the same data type.
Create an numeric vector.
R
my_numbers <- c(1, 2, 5)
my_numbers
OUTPUT
[1] 1 2 5
R
typeof(my_numbers)
OUTPUT
[1] "double"
Create an character vector.
R
my_words <- c('the', 'dog')
my_words
OUTPUT
[1] "the" "dog"
R
typeof(my_words)
OUTPUT
[1] "character"
If you try to create a vector with multiple types, R will coerce all the values to the same type.
When there are numbers and charcters in a vector, all values are coerced to string.
R
mixed <- c(1, 2, 'three')
mixed
OUTPUT
[1] "1" "2" "three"
R
typeof(mixed)
OUTPUT
[1] "character"
Missing data
When dealing with data, there are times when a record does not have a
value for a field. Imagine filling out a form, and leaving some of the
fields blank. R represents missing data as NA
, without
quotes. Let’s make a numeric vector with an NA
value:
R
ages <- c(25, 34, NA, 42)
ages
OUTPUT
[1] 25 34 NA 42
min()
returns the minimum value in a vector. If we pass
vector with NA a numeric function like min()
, R won’t know
what to do, so it returns NA
:
R
min(ages)
OUTPUT
[1] NA
Many basic math functions use na.rm
argument to remove
NA values from the vector when doing the calculation.
R
min(ages, na.rm = TRUE)
OUTPUT
[1] 25
Comments
All programming languages allow the programmer to include comments in their code to explain the code.
To do this in R we use the
#
character. Anything to the right of the#
sign and up to the end of the line is treated as a comment and is ignored by R. You can start lines with comments or include them after any code on the line.R
OUTPUT
R
RStudio makes it easy to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard Ctrl + Shift + C. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press Ctrl + Shift + C.