Data Wrangling & Summarizing

Note that throughout this tutorial, I have “comments” on my code, which is non-coding language included in the script. They are marked with a # sign at the beginning. Some codes, like the installation codes have been commented out, but please feel free to uncomment them as you need.

In this section, we will be using a pre-existing dataset in ‘nycflights13’ package. (Yes! Some R packages include data that you can play around with.) The dataset that we will be using is named flights. As we learned in the last section, you first need to install the package, if you have not done so already. Then, you need to bring it up, using the library code.

#install.packages("nycflights13")
library(nycflights13)

?flights #data documentation 
#uncomment below to look at the data
#head(flights) 
#summary(flights)
#colnames(flights)

Tidyverse

There are many ways to play around with data on R, but we will use the tidyverse package. Recall that tidyverse is a collection of packages that includes lots of cool tools: ggplot2, dplyr, tidyr, readr. You can also individually download each package. For more information, please see here.

#install.packages("tidyverse")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Filtering Data

What to do if you want to look at only a subset of data? Well, filter() is the function that you are looking for!

flights %>% filter(month > 7) 
# A tibble: 141,193 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    10     1      447            500       -13      614            648
 2  2013    10     1      522            517         5      735            757
 3  2013    10     1      536            545        -9      809            855
 4  2013    10     1      539            545        -6      801            827
 5  2013    10     1      539            545        -6      917            933
 6  2013    10     1      544            550        -6      912            932
 7  2013    10     1      549            600       -11      653            716
 8  2013    10     1      550            600       -10      648            700
 9  2013    10     1      550            600       -10      649            659
10  2013    10     1      551            600        -9      727            730
# ℹ 141,183 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

For those who have never seen %\>% sign before, please note that we will be using this a lot. We call it “a pipe”, and we use it to link two codes. Instead of typing one % sign, >, and another %, you can press ctrl + shift + M (Windows) or cmd + shift + M (Mac).

You can even have multiple filters.

flights %>% filter(month == 2) %>% 
  filter(day == 14) #Flights on Valentine's Day
# A tibble: 956 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     2    14        7           2352        15      448            437
 2  2013     2    14       59           2339        80      205            106
 3  2013     2    14      454            500        -6      641            648
 4  2013     2    14      510            515        -5      750            814
 5  2013     2    14      531            530         1      828            831
 6  2013     2    14      541            540         1      850            850
 7  2013     2    14      542            545        -3     1014           1023
 8  2013     2    14      551            600        -9      831            906
 9  2013     2    14      552            600        -8      657            708
10  2013     2    14      553            600        -7      902            856
# ℹ 946 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(dest == "ATL" | dest == "SFO") #Flights going to ATL OR SFO
# A tibble: 30,546 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      554            600        -6      812            837
 2  2013     1     1      558            600        -2      923            937
 3  2013     1     1      600            600         0      837            825
 4  2013     1     1      606            610        -4      837            845
 5  2013     1     1      611            600        11      945            931
 6  2013     1     1      615            615         0      833            842
 7  2013     1     1      655            700        -5     1037           1045
 8  2013     1     1      658            700        -2      944            939
 9  2013     1     1      729            730        -1     1049           1115
10  2013     1     1      734            737        -3     1047           1113
# ℹ 30,536 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(dest == "ATL" & day == "14") #Flights going to ATL and on the 14th day
# A tibble: 562 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    14      553            600        -7      813            815
 2  2013     1    14      553            600        -7      818            825
 3  2013     1    14      555            600        -5      827            837
 4  2013     1    14      612            615        -3      839            842
 5  2013     1    14      624            630        -6      935            904
 6  2013     1    14      649            615        34      950            855
 7  2013     1    14      655            659        -4      927            938
 8  2013     1    14      728            730        -2     1004           1004
 9  2013     1    14      806            759         7     1031           1039
10  2013     1    14      807            810        -3     1039           1044
# ℹ 552 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

So far, we’ve looked at subsets of data. You can also save these subsets into separate objects for future reference.

late_xmas <- flights %>%
  filter(arr_time > sched_arr_time) %>%
  filter(month == 12, day == 25) #always use a different name for newly created dataset!
#Note that when you save the data as an object, R will not automatically show the result.
head(late_xmas)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013    12    25      600            600         0      850            846
2  2013    12    25      610            610         0      922            915
3  2013    12    25      612            615        -3      859            852
4  2013    12    25      620            615         5      818            817
5  2013    12    25      627            625         2     1010           1009
6  2013    12    25      628            630        -2      949            919
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Summarizing Data

summarise()

Remember summary() from basic R? There is a similar function in the tidyverse package. summarise() allows you to look at the mean, maximum, and minimum value of a variable. For instance, let’s see how late flights are in their departure time.

flights %>%
  summarise(avg_late = mean(dep_delay, na.rm = TRUE),
            most_late = max(dep_delay, na.rm = TRUE))
# A tibble: 1 × 2
  avg_late most_late
     <dbl>     <dbl>
1     12.6      1301
#Note that na.rm = TRUE argument removes all the missing values.

group_by()

What if we wanted to know these statistics for each carrier in our dataset? group_by lets you reshape the data by variables of your choice.

carrier_late <- flights %>%
  group_by(carrier) %>%
  summarise(avg_late = mean(dep_delay, na.rm = TRUE),
            most_late = max(dep_delay, na.rm = TRUE))
carrier_late
# A tibble: 16 × 3
   carrier avg_late most_late
   <chr>      <dbl>     <dbl>
 1 9E         16.7        747
 2 AA          8.59      1014
 3 AS          5.80       225
 4 B6         13.0        502
 5 DL          9.26       960
 6 EV         20.0        548
 7 F9         20.2        853
 8 FL         18.7        602
 9 HA          4.90      1301
10 MQ         10.6       1137
11 OO         12.6        154
12 UA         12.1        483
13 US          3.78       500
14 VX         12.9        653
15 WN         17.7        471
16 YV         19.0        387

Mutating Data

mutate()

Not only can we look at parts of existing data, we can add a new variable to the existing data. What you are looking for is mutate() function, which allows you to combine or alter existing variables.

Let’s see some examples.

#converting distance to thousands of miles
fl.2 <- flights %>% mutate(dist_miles = distance/1000)
head(fl.2)
# A tibble: 6 × 20
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, dist_miles <dbl>
#For the flights departing from JFK, how late did each plane arrive at the destination? 
fl.3 <- flights %>%
  filter(origin == "JFK") %>% 
  mutate(late = arr_time - sched_arr_time) 
head(fl.3)
# A tibble: 6 × 20
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      542            540         2      923            850
2  2013     1     1      544            545        -1     1004           1022
3  2013     1     1      557            600        -3      838            846
4  2013     1     1      558            600        -2      849            851
5  2013     1     1      558            600        -2      853            856
6  2013     1     1      558            600        -2      924            917
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, late <int>

tally()

If you are simply curious about how many flights each carrier has, you can use a function called tally() in combination with group_by. This is a useful function to calculate percentages as well.

flights %>% group_by(carrier) %>% tally()
# A tibble: 16 × 2
   carrier     n
   <chr>   <int>
 1 9E      18460
 2 AA      32729
 3 AS        714
 4 B6      54635
 5 DL      48110
 6 EV      54173
 7 F9        685
 8 FL       3260
 9 HA        342
10 MQ      26397
11 OO         32
12 UA      58665
13 US      20536
14 VX       5162
15 WN      12275
16 YV        601
flights %>% group_by(origin) %>% tally()
# A tibble: 3 × 2
  origin      n
  <chr>   <int>
1 EWR    120835
2 JFK    111279
3 LGA    104662
flights %>% group_by(carrier, origin) %>% tally()
# A tibble: 35 × 3
# Groups:   carrier [16]
   carrier origin     n
   <chr>   <chr>  <int>
 1 9E      EWR     1268
 2 9E      JFK    14651
 3 9E      LGA     2541
 4 AA      EWR     3487
 5 AA      JFK    13783
 6 AA      LGA    15459
 7 AS      EWR      714
 8 B6      EWR     6557
 9 B6      JFK    42076
10 B6      LGA     6002
# ℹ 25 more rows
flights %>% group_by(carrier, origin) %>% tally() %>% mutate(percentage = n/sum(n))
# A tibble: 35 × 4
# Groups:   carrier [16]
   carrier origin     n percentage
   <chr>   <chr>  <int>      <dbl>
 1 9E      EWR     1268     0.0687
 2 9E      JFK    14651     0.794 
 3 9E      LGA     2541     0.138 
 4 AA      EWR     3487     0.107 
 5 AA      JFK    13783     0.421 
 6 AA      LGA    15459     0.472 
 7 AS      EWR      714     1     
 8 B6      EWR     6557     0.120 
 9 B6      JFK    42076     0.770 
10 B6      LGA     6002     0.110 
# ℹ 25 more rows

If the tally() function is empty, R will just count up the number of observations and create a new variable named n. You can also add up a variable of your choice by putting something inside the tally() function.

Section 3 Task:

Below is a set of assignments that you can try on your own. The answers are below the questions, but I highly encourage that you try to answer the questions on your own without looking at the solutions.

Use flights dataset from ‘nycflights13’ package for this assignment

  1. Subset the data to include only United Airlines flights.

    Hint: Subset the data that meets the following condition [carrier == “UA”].

Answer:

flights %>% filter(carrier == "UA")
# A tibble: 58,665 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      554            558        -4      740            728
 4  2013     1     1      558            600        -2      924            917
 5  2013     1     1      558            600        -2      923            937
 6  2013     1     1      559            600        -1      854            902
 7  2013     1     1      607            607         0      858            915
 8  2013     1     1      611            600        11      945            931
 9  2013     1     1      623            627        -4      933            932
10  2013     1     1      628            630        -2     1016            947
# ℹ 58,655 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Save the above subset of data as “flights_UA”.

Answer:

flights_UA <- flights %>% filter(carrier == "UA")
  1. Using the data from Q2 (flights_UA), calculate the mean distance traveled, and show it by their destination.

    Hint(1): Destination is saved as “dest” in the dataset, and distance is “distance”.

    Hint(2): First do “group_by”, and then use “summarise”.

    Hint(3): Your code should have “mean(distance, na.rm = TRUE)” somewhere.

Answer:

flights_UA %>% 
      group_by(dest) %>%
      summarise(mean_dist = mean(distance, na.rm = TRUE))
# A tibble: 47 × 2
   dest  mean_dist
   <chr>     <dbl>
 1 ANC       3370 
 2 ATL        746 
 3 AUS       1504 
 4 BDL        116 
 5 BOS        200 
 6 BQN       1585 
 7 BZN       1882 
 8 CHS        628 
 9 CLE        406.
10 CLT        529 
# ℹ 37 more rows