Basics of R

In this section, we are going to review the very basic structure of R. R has many built in functions, ranging from basic arithmetic operations to regression models. It’s like basic apps that come with your smartphones when you buy one. It doesn’t require anything else on your part, such as going to app store and downloading an app. Of course, you can also download other more advanced “packages” on R, just like you can download other apps on your phone! (More on this in a bit.)

First, let’s start with very basic arithmetic operations.

1) Basic Arithmetic Operations

# +: addition
5 + 3

[1] 8

# -: subtraction
5 - 3

[1] 2

# *: multiplication
5 * 3

[1] 15

# /: division
5 / 3

[1] 1.666667

# ^: exponentiation
5 ^ 3

[1] 125

2) Assigning Values to Variables

R is an object based language, which means that we can save values into “objects”. To do so, you can use either = or <-. An object name can be anything of your choice, as long as it doest NOT start with a number or contain a space.

# use "<-" or "=" to assign 
result <- 5 + 3 
result = 5 + 3  # avoid using "=" for assignments
class(result)   # check class of the object

[1] "numeric"

You can also assign text as objects (use quotation marks). IF you forget the quotation marks, R will give you an error.

fruit <- "apple"
fruit

[1] "apple"

class(fruit)    # character

[1] "character"

Reassigning a value to an existing object replaces the original.

fruit <- "banana"
fruit

[1] "banana"

3) Logical Operators

5 > 3     # TRUE

[1] TRUE

5 < 3     # FALSE

[1] FALSE

5 <= 5    # TRUE

[1] TRUE

5 == 5    # equal to

[1] TRUE

5 != 5    # not equal to

[1] FALSE

x <- 5 > 3
x

[1] TRUE

class(x)  # logical

[1] "logical"

4) Vectors

A vector combines multiple values of the same type.

# numeric vector
num.vec <- c(25, 21, 18, 29, 35)
num.vec

[1] 25 21 18 29 35

class(num.vec)

[1] "numeric"

length(num.vec)

[1] 5

# character vector
cha.vec <- c("cake", "banana", "dog", "apple")
cha.vec

[1] "cake"   "banana" "dog"    "apple"

class(cha.vec)

[1] "character"

length(cha.vec)

[1] 4

# logical vector
log.vec <- c(5 > 3, 5 < 3, 5 == 5, 5 != 5)
log.vec

[1]  TRUE FALSE  TRUE FALSE

class(log.vec)

[1] "logical"

length(log.vec)

[1] 4

Combine multiple vectors:

metavector <- c(num.vec, cha.vec, log.vec)
metavector

 [1] "25"     "21"     "18"     "29"     "35"     "cake"   "banana" "dog"   
 [9] "apple"  "TRUE"   "FALSE"  "TRUE"   "FALSE"

class(metavector)    # character (coercion)

[1] "character"

length(metavector)

[1] 13

Reminder: A vector must have a single type. Mixed types will be coerced.
Coercion hierarchy: character > numeric > logical

metavector2 <- c(num.vec, log.vec)
metavector2    # TRUE → 1, FALSE → 0

[1] 25 21 18 29 35  1  0  1  0

class(metavector2)  # numeric

[1] "numeric"

Subset a vector:

num.vec[1]          # first element

[1] 25

num.vec[-1]         # all except first

[1] 21 18 29 35

num.vec[2:4]        # 2nd to 4th

[1] 21 18 29

num.vec[num.vec < 30]  # elements < 30

[1] 25 21 18 29

Create numeric sequences:

c(1:9)

[1] 1 2 3 4 5 6 7 8 9

rep(x = 1, times = 9)

[1] 1 1 1 1 1 1 1 1 1

seq(from = 1, to = 9, by = 1)

[1] 1 2 3 4 5 6 7 8 9

seq(1, 9, 2)

[1] 1 3 5 7 9

seq(9, 1, -2)

[1] 9 7 5 3 1

Basic vector calculations:

max(num.vec)

[1] 35

min(num.vec)

[1] 18

range(num.vec)

[1] 18 35

sum(num.vec)

[1] 128

mean(num.vec)

[1] 25.6

prod(num.vec)

[1] 9591750

sd(num.vec)

[1] 6.69328

var(num.vec)

[1] 44.8

sqrt(num.vec)

[1] 5.000000 4.582576 4.242641 5.385165 5.916080

sort(num.vec)

[1] 18 21 25 29 35

sort(cha.vec)

[1] "apple"  "banana" "cake"   "dog"

sort(log.vec)

[1] FALSE FALSE  TRUE  TRUE

5) Packages and Libraries

Just like smartphones, R has many ready to be used applications. We call them “packages”. Inside a package, you can find different functions and datasets.

Just like you’d have to install and open up smartphone applications, we also need to install and bring up the packages.

⚠️ You have to “load” the packages every time you open up your R Studio using the library code.

library(MASS)

If you get an error, install it first:

install.packages("MASS")

You only have to install a package ONCE. (Just like you only need to download an app ONCE.)

# Example: convert decimal to fraction
x <- 5 / 3
fractions(x)

[1] 5/3

6) Uploading Data

Now, I am finally introducing the tidyverse package, which is one of the most commonly used packages to clean, manipulate, and create visualizations of data. It is actually a compilation of multiple packages such as dplyr or ggplot2. If you want to know more, please refer to this website on tidyverse.

Let’s first install the tidyverse package and read data. ✅ If you would like to follow along, the data used here is available on the Topics page.

#install.packages("tidyverse")  # install it once
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::select() masks MASS::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

dta <- read_csv("data/example.csv", col_names = TRUE)

Rows: 44 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): id, gender, division, year, participation, homework, midterm, final...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Below is a set of basic R functions that are useful to explore your data. (In the next section, we will start to use other functions in tidyverse to manipulate and wrangle the data.)

head(dta) # first few lines

# A tibble: 6 × 8
     id gender division  year participation homework midterm final_exam
  <dbl>  <dbl>    <dbl> <dbl>         <dbl>    <dbl>   <dbl>      <dbl>
1     1      0        1     2           100       99      89         64
2     2      0        1     2           100       97      85         82
3     3      0        1     2           100       98      74         81
4     4      0        1     2           100       99      85         89
5     5      1        1     2           100      100      90         80
6     6      0        1     2           100      100      94         96

tail(dta) # last few lines

# A tibble: 6 × 8
     id gender division  year participation homework midterm final_exam
  <dbl>  <dbl>    <dbl> <dbl>         <dbl>    <dbl>   <dbl>      <dbl>
1    39      0        3     4            25       56      40         65
2    40      0        3     4            50       95      65         76
3    41      1        3     4            25       62      60         70
4    42      0        4     2           100       94      78         74
5    43      1        4     3           100      100      90         90
6    44      1        4     4           100      100      98         85

colnames(dta) # column names

[1] "id"            "gender"        "division"      "year"         
[5] "participation" "homework"      "midterm"       "final_exam"

dim(dta) # dimension of the data

[1] 44  8

summary(dta) # summary of all variables

       id            gender          division          year      
 Min.   : 1.00   Min.   :0.0000   Min.   :1.000   Min.   :2.000  
 1st Qu.:11.75   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:2.000  
 Median :22.50   Median :0.0000   Median :1.000   Median :2.000  
 Mean   :22.50   Mean   :0.3182   Mean   :1.932   Mean   :2.523  
 3rd Qu.:33.25   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:3.000  
 Max.   :44.00   Max.   :1.0000   Max.   :4.000   Max.   :4.000  
 participation       homework         midterm         final_exam   
 Min.   :  0.00   Min.   : 56.00   Min.   : 40.00   Min.   :46.00  
 1st Qu.: 75.00   1st Qu.: 82.75   1st Qu.: 70.00   1st Qu.:66.50  
 Median :100.00   Median : 97.00   Median : 81.00   Median :80.50  
 Mean   : 84.09   Mean   : 90.39   Mean   : 78.43   Mean   :76.39  
 3rd Qu.:100.00   3rd Qu.: 99.00   3rd Qu.: 89.00   3rd Qu.:86.25  
 Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :97.00

You can access variables (columns) by linking the column name with a dataset.

dta$gender

 [1] 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
[39] 0 0 1 0 1 1

dta$participation

 [1] 100 100 100 100 100 100 100 100  75  75 100 100  75 100 100 100  75  75   0
[20] 100  75  50  75 100  50 100 100 100 100 100 100 100 100  75  50  75  75 100
[39]  25  50  25 100 100 100

You should also remember the “brackets” method to bring a specific row, column, or a cell.

dta[1,]

# A tibble: 1 × 8
     id gender division  year participation homework midterm final_exam
  <dbl>  <dbl>    <dbl> <dbl>         <dbl>    <dbl>   <dbl>      <dbl>
1     1      0        1     2           100       99      89         64

dta[,1]

# A tibble: 44 × 1
      id
   <dbl>
 1     1
 2     2
 3     3
 4     4
 5     5
 6     6
 7     7
 8     8
 9     9
10    10
# ℹ 34 more rows

dta[1,1]

# A tibble: 1 × 1
     id
  <dbl>
1     1