Grouping

A very common task is to split a data frame into subgroups, operate somehow on each group, and then bind the results back into one data frame: split-do-bind.

Let’s do something simple with the flight data: split into groupings by origin and carrier and compute the mean dep_delay.

1 data.frame

The classic way to do this involves the split, lapply and rbind (with help from do.call) functions. To split onto two groups we need to make a charcater (or factor) vector with compound meaning, origin_carrier, and split on that.

df = read_flights(type = "data.frame")
o_c = paste(df$origin, df$carrier, sep = "_")
df = split(df, o_c) |>
  lapply(
    function(grp){
      data.frame(origin = grp$origin[1],
                 carrier = grp$carrier[1],
                 mean_dep_delay = mean(grp$dep_delay, na.rm = TRUE))
    }
  )
df = do.call(rbind, df)
df

       origin carrier mean_dep_delay
EWR_9E    EWR      9E       5.951667
EWR_AA    EWR      AA      10.035419
EWR_AS    EWR      AS       5.804775
EWR_B6    EWR      B6      13.100262
EWR_DL    EWR      DL      12.084592
EWR_EV    EWR      EV      20.164931
EWR_MQ    EWR      MQ      17.467268
EWR_OO    EWR      OO      20.833333
EWR_UA    EWR      UA      12.522869
EWR_US    EWR      US       3.735104
EWR_VX    EWR      VX      11.927378
EWR_WN    EWR      WN      17.864376
JFK_9E    JFK      9E      19.001517
JFK_AA    JFK      AA      10.302155
JFK_B6    JFK      B6      12.757453
JFK_DL    JFK      DL       8.333188
JFK_EV    JFK      EV      18.520362
JFK_HA    JFK      HA       4.900585
JFK_MQ    JFK      MQ      13.199971
JFK_UA    JFK      UA       7.900000
JFK_US    JFK      US       5.866959
JFK_VX    JFK      VX      13.279441
LGA_9E    LGA      9E       8.894182
LGA_AA    LGA      AA       6.705769
LGA_B6    LGA      B6      14.805738
LGA_DL    LGA      DL       9.572997
LGA_EV    LGA      EV      19.125500
LGA_F9    LGA      F9      20.215543
LGA_FL    LGA      FL      18.726075
LGA_MQ    LGA      MQ       8.528569
LGA_OO    LGA      OO      10.434783
LGA_UA    LGA      UA      12.087916
LGA_US    LGA      US       3.306505
LGA_WN    LGA      WN      17.557000
LGA_YV    LGA      YV      18.996330

2 tibble

Here we simply tag the tibble with the desired grouping. Our first attempt is to apply an anonymous function to each group using group_map.

tbl = read_flights(type = "tibble")
tbl = dplyr::group_by(tbl, origin, carrier) |>
  dplyr::group_map(
    function(grp, key){
      key |>
        dplyr::mutate(mean_dep_delay = mean(grp$dep_delay, na.rm = TRUE))
    }
  ) |>
  dplyr::bind_rows()
tbl

# A tibble: 35 × 3
   origin carrier mean_dep_delay
   <chr>  <chr>            <dbl>
 1 EWR    9E                5.95
 2 EWR    AA               10.0 
 3 EWR    AS                5.80
 4 EWR    B6               13.1 
 5 EWR    DL               12.1 
 6 EWR    EV               20.2 
 7 EWR    MQ               17.5 
 8 EWR    OO               20.8 
 9 EWR    UA               12.5 
10 EWR    US                3.74
# ℹ 25 more rows

But keep in mind that many of the dplyr functions are “group-aware”, so we could simplify…

tbl = read_flights(type = "tibble")

tbl = dplyr::group_by(tbl, origin, carrier) |>
  dplyr::mutate(mean_dep_delay = mean(.data$dep_delay, na.rm = TRUE)) |>
  dplyr::slice(1) |>
  dplyr::select(origin, carrier, mean_dep_delay) |>
  dplyr::ungroup()
tbl

# A tibble: 35 × 3
   origin carrier mean_dep_delay
   <chr>  <chr>            <dbl>
 1 EWR    9E                5.95
 2 EWR    AA               10.0 
 3 EWR    AS                5.80
 4 EWR    B6               13.1 
 5 EWR    DL               12.1 
 6 EWR    EV               20.2 
 7 EWR    MQ               17.5 
 8 EWR    OO               20.8 
 9 EWR    UA               12.5 
10 EWR    US                3.74
# ℹ 25 more rows

3 data.table

This is quite different. I’m wobbly enough at data.table to know how to do this, but not know why I do this step this particular way.

dt = read_flights(type = "data.table")

dt[, 
    mean(.SD[[1]], na.rm = TRUE), 
    by = .(origin, carrier), 
    .SDcols = "dep_delay"]

    origin carrier        V1
    <char>  <char>     <num>
 1:    EWR      UA 12.522869
 2:    LGA      UA 12.087916
 3:    JFK      AA 10.302155
 4:    JFK      B6 12.757453
 5:    LGA      DL  9.572997
 6:    EWR      B6 13.100262
 7:    LGA      EV 19.125500
 8:    LGA      AA  6.705769
 9:    JFK      UA  7.900000
10:    LGA      B6 14.805738
11:    LGA      MQ  8.528569
12:    EWR      AA 10.035419
13:    JFK      DL  8.333188
14:    EWR      MQ 17.467268
15:    EWR      DL 12.084592
16:    EWR      US  3.735104
17:    EWR      EV 20.164931
18:    JFK      US  5.866959
19:    LGA      WN 17.557000
20:    JFK      VX 13.279441
21:    LGA      FL 18.726075
22:    EWR      AS  5.804775
23:    LGA      US  3.306505
24:    JFK      MQ 13.199971
25:    JFK      9E 19.001517
26:    LGA      F9 20.215543
27:    EWR      WN 17.864376
28:    JFK      HA  4.900585
29:    JFK      EV 18.520362
30:    EWR      9E  5.951667
31:    LGA      9E  8.894182
32:    LGA      YV 18.996330
33:    LGA      OO 10.434783
34:    EWR      VX 11.927378
35:    EWR      OO 20.833333
    origin carrier        V1
    <char>  <char>     <num>

4 tidytable

This starts out quite similar to tibble, but you’ll note that there is no need to create an anonymous function to use within group_map(). Instead, the tidytable helper functions are “group-aware”.

tt = read_flights(type = "tidytable")
tt = tidytable::group_by(tt, origin, carrier) |>
  tidytable::mutate(mean_dep_delay = mean(.data$dep_delay, na.rm = TRUE)) |>
  tidytable::slice(1) |>
  tidytable::select(origin, carrier, mean_dep_delay) |>
  tidytable::ungroup()
tt

# A tidytable: 35 × 3
   origin carrier mean_dep_delay
   <chr>  <chr>            <dbl>
 1 EWR    UA               12.5 
 2 LGA    UA               12.1 
 3 JFK    AA               10.3 
 4 JFK    B6               12.8 
 5 LGA    DL                9.57
 6 EWR    B6               13.1 
 7 LGA    EV               19.1 
 8 LGA    AA                6.71
 9 JFK    UA                7.9 
10 LGA    B6               14.8 
# ℹ 25 more rows

If you have a more complex function to apply to each step, you can revert back to the more basic split-do-bind approach seen with data.frame.

tt = read_flights(type = "tidytable")
tt = tidytable::group_by(tt, origin, carrier) |>
  tidytable::group_split() |>
  lapply(
    function(grp){
      tidytable::tidytable(
        origin = grp$origin[1],
        carrier = grp$carrier[1],
        mean_dep_delay = mean(grp$dep_delay, na.rm = TRUE)
      )
    }
  ) |>
  tidytable::bind_rows()
tt

# A tidytable: 35 × 3
   origin carrier mean_dep_delay
   <chr>  <chr>            <dbl>
 1 EWR    UA               12.5 
 2 LGA    UA               12.1 
 3 JFK    AA               10.3 
 4 JFK    B6               12.8 
 5 LGA    DL                9.57
 6 EWR    B6               13.1 
 7 LGA    EV               19.1 
 8 LGA    AA                6.71
 9 JFK    UA                7.90
10 LGA    B6               14.8 
# ℹ 25 more rows