#install.packages("nycflights13")
library(nycflights13)
?flights #data documentation
#uncomment below to look at the data
#head(flights)
#summary(flights)
#colnames(flights)Data Wrangling & Summarizing
Note that throughout this tutorial, I have “comments” on my code, which is non-coding language included in the script. They are marked with a # sign at the beginning. Some codes, like the installation codes have been commented out, but please feel free to uncomment them as you need.
In this section, we will be using a pre-existing dataset in ‘nycflights13’ package. (Yes! Some R packages include data that you can play around with.) The dataset that we will be using is named flights. As we learned in the last section, you first need to install the package, if you have not done so already. Then, you need to bring it up, using the library code.
Tidyverse
There are many ways to play around with data on R, but we will use the tidyverse package. Recall that tidyverse is a collection of packages that includes lots of cool tools: ggplot2, dplyr, tidyr, readr. You can also individually download each package. For more information, please see here.
#install.packages("tidyverse")
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Filtering Data
What to do if you want to look at only a subset of data? Well, filter() is the function that you are looking for!
flights %>% filter(month > 7) # A tibble: 141,193 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 10 1 447 500 -13 614 648
2 2013 10 1 522 517 5 735 757
3 2013 10 1 536 545 -9 809 855
4 2013 10 1 539 545 -6 801 827
5 2013 10 1 539 545 -6 917 933
6 2013 10 1 544 550 -6 912 932
7 2013 10 1 549 600 -11 653 716
8 2013 10 1 550 600 -10 648 700
9 2013 10 1 550 600 -10 649 659
10 2013 10 1 551 600 -9 727 730
# ℹ 141,183 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
For those who have never seen %\>% sign before, please note that we will be using this a lot. We call it “a pipe”, and we use it to link two codes. Instead of typing one % sign, >, and another %, you can press ctrl + shift + M (Windows) or cmd + shift + M (Mac).
You can even have multiple filters.
flights %>% filter(month == 2) %>%
filter(day == 14) #Flights on Valentine's Day# A tibble: 956 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 2 14 7 2352 15 448 437
2 2013 2 14 59 2339 80 205 106
3 2013 2 14 454 500 -6 641 648
4 2013 2 14 510 515 -5 750 814
5 2013 2 14 531 530 1 828 831
6 2013 2 14 541 540 1 850 850
7 2013 2 14 542 545 -3 1014 1023
8 2013 2 14 551 600 -9 831 906
9 2013 2 14 552 600 -8 657 708
10 2013 2 14 553 600 -7 902 856
# ℹ 946 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(dest == "ATL" | dest == "SFO") #Flights going to ATL OR SFO# A tibble: 30,546 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 554 600 -6 812 837
2 2013 1 1 558 600 -2 923 937
3 2013 1 1 600 600 0 837 825
4 2013 1 1 606 610 -4 837 845
5 2013 1 1 611 600 11 945 931
6 2013 1 1 615 615 0 833 842
7 2013 1 1 655 700 -5 1037 1045
8 2013 1 1 658 700 -2 944 939
9 2013 1 1 729 730 -1 1049 1115
10 2013 1 1 734 737 -3 1047 1113
# ℹ 30,536 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>% filter(dest == "ATL" & day == "14") #Flights going to ATL and on the 14th day# A tibble: 562 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 14 553 600 -7 813 815
2 2013 1 14 553 600 -7 818 825
3 2013 1 14 555 600 -5 827 837
4 2013 1 14 612 615 -3 839 842
5 2013 1 14 624 630 -6 935 904
6 2013 1 14 649 615 34 950 855
7 2013 1 14 655 659 -4 927 938
8 2013 1 14 728 730 -2 1004 1004
9 2013 1 14 806 759 7 1031 1039
10 2013 1 14 807 810 -3 1039 1044
# ℹ 552 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
So far, we’ve looked at subsets of data. You can also save these subsets into separate objects for future reference.
late_xmas <- flights %>%
filter(arr_time > sched_arr_time) %>%
filter(month == 12, day == 25) #always use a different name for newly created dataset!
#Note that when you save the data as an object, R will not automatically show the result.
head(late_xmas)# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 12 25 600 600 0 850 846
2 2013 12 25 610 610 0 922 915
3 2013 12 25 612 615 -3 859 852
4 2013 12 25 620 615 5 818 817
5 2013 12 25 627 625 2 1010 1009
6 2013 12 25 628 630 -2 949 919
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Summarizing Data
summarise()
Remember summary() from basic R? There is a similar function in the tidyverse package. summarise() allows you to look at the mean, maximum, and minimum value of a variable. For instance, let’s see how late flights are in their departure time.
flights %>%
summarise(avg_late = mean(dep_delay, na.rm = TRUE),
most_late = max(dep_delay, na.rm = TRUE))# A tibble: 1 × 2
avg_late most_late
<dbl> <dbl>
1 12.6 1301
#Note that na.rm = TRUE argument removes all the missing values.group_by()
What if we wanted to know these statistics for each carrier in our dataset? group_by lets you reshape the data by variables of your choice.
carrier_late <- flights %>%
group_by(carrier) %>%
summarise(avg_late = mean(dep_delay, na.rm = TRUE),
most_late = max(dep_delay, na.rm = TRUE))
carrier_late# A tibble: 16 × 3
carrier avg_late most_late
<chr> <dbl> <dbl>
1 9E 16.7 747
2 AA 8.59 1014
3 AS 5.80 225
4 B6 13.0 502
5 DL 9.26 960
6 EV 20.0 548
7 F9 20.2 853
8 FL 18.7 602
9 HA 4.90 1301
10 MQ 10.6 1137
11 OO 12.6 154
12 UA 12.1 483
13 US 3.78 500
14 VX 12.9 653
15 WN 17.7 471
16 YV 19.0 387
Mutating Data
mutate()
Not only can we look at parts of existing data, we can add a new variable to the existing data. What you are looking for is mutate() function, which allows you to combine or alter existing variables.
Let’s see some examples.
#converting distance to thousands of miles
fl.2 <- flights %>% mutate(dist_miles = distance/1000)
head(fl.2)# A tibble: 6 × 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, dist_miles <dbl>
#For the flights departing from JFK, how late did each plane arrive at the destination?
fl.3 <- flights %>%
filter(origin == "JFK") %>%
mutate(late = arr_time - sched_arr_time)
head(fl.3)# A tibble: 6 × 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 542 540 2 923 850
2 2013 1 1 544 545 -1 1004 1022
3 2013 1 1 557 600 -3 838 846
4 2013 1 1 558 600 -2 849 851
5 2013 1 1 558 600 -2 853 856
6 2013 1 1 558 600 -2 924 917
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, late <int>
tally()
If you are simply curious about how many flights each carrier has, you can use a function called tally() in combination with group_by. This is a useful function to calculate percentages as well.
flights %>% group_by(carrier) %>% tally()# A tibble: 16 × 2
carrier n
<chr> <int>
1 9E 18460
2 AA 32729
3 AS 714
4 B6 54635
5 DL 48110
6 EV 54173
7 F9 685
8 FL 3260
9 HA 342
10 MQ 26397
11 OO 32
12 UA 58665
13 US 20536
14 VX 5162
15 WN 12275
16 YV 601
flights %>% group_by(origin) %>% tally()# A tibble: 3 × 2
origin n
<chr> <int>
1 EWR 120835
2 JFK 111279
3 LGA 104662
flights %>% group_by(carrier, origin) %>% tally()# A tibble: 35 × 3
# Groups: carrier [16]
carrier origin n
<chr> <chr> <int>
1 9E EWR 1268
2 9E JFK 14651
3 9E LGA 2541
4 AA EWR 3487
5 AA JFK 13783
6 AA LGA 15459
7 AS EWR 714
8 B6 EWR 6557
9 B6 JFK 42076
10 B6 LGA 6002
# ℹ 25 more rows
flights %>% group_by(carrier, origin) %>% tally() %>% mutate(percentage = n/sum(n))# A tibble: 35 × 4
# Groups: carrier [16]
carrier origin n percentage
<chr> <chr> <int> <dbl>
1 9E EWR 1268 0.0687
2 9E JFK 14651 0.794
3 9E LGA 2541 0.138
4 AA EWR 3487 0.107
5 AA JFK 13783 0.421
6 AA LGA 15459 0.472
7 AS EWR 714 1
8 B6 EWR 6557 0.120
9 B6 JFK 42076 0.770
10 B6 LGA 6002 0.110
# ℹ 25 more rows
If the tally() function is empty, R will just count up the number of observations and create a new variable named n. You can also add up a variable of your choice by putting something inside the tally() function.
Section 3 Task:
Below is a set of assignments that you can try on your own. The answers are below the questions, but I highly encourage that you try to answer the questions on your own without looking at the solutions.
Use flights dataset from ‘nycflights13’ package for this assignment
Subset the data to include only United Airlines flights.
Hint: Subset the data that meets the following condition [carrier == “UA”].
Answer:
flights %>% filter(carrier == "UA")# A tibble: 58,665 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 554 558 -4 740 728
4 2013 1 1 558 600 -2 924 917
5 2013 1 1 558 600 -2 923 937
6 2013 1 1 559 600 -1 854 902
7 2013 1 1 607 607 0 858 915
8 2013 1 1 611 600 11 945 931
9 2013 1 1 623 627 -4 933 932
10 2013 1 1 628 630 -2 1016 947
# ℹ 58,655 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
- Save the above subset of data as “flights_UA”.
Answer:
flights_UA <- flights %>% filter(carrier == "UA")Using the data from Q2 (flights_UA), calculate the mean distance traveled, and show it by their destination.
Hint(1): Destination is saved as “dest” in the dataset, and distance is “distance”.
Hint(2): First do “group_by”, and then use “summarise”.
Hint(3): Your code should have “mean(distance, na.rm = TRUE)” somewhere.
Answer:
flights_UA %>%
group_by(dest) %>%
summarise(mean_dist = mean(distance, na.rm = TRUE))# A tibble: 47 × 2
dest mean_dist
<chr> <dbl>
1 ANC 3370
2 ATL 746
3 AUS 1504
4 BDL 116
5 BOS 200
6 BQN 1585
7 BZN 1882
8 CHS 628
9 CLE 406.
10 CLT 529
# ℹ 37 more rows