Iterations

Iteration - just what the heck is it really? The word comes from the Latin ‘iterare’ which means to repeat. Computer programs are perfect for doing repetitious things which releases the coder from mundane tasks so we can spend more time checking the internet. But they do have an initial cost - somebody has to code up the iteration. That’s what this tutorial is about: how to code up iterations.

The most famous iteration framework is the for-loop, but it has cousins in the repeat-until and while-do constructs. R-language has additional tools built in for iterating. These are in the apply family including lappy, sapply, mapply, tapply and vapply. Third party packages, such as those from Tidyverse and data.table also provide tools for iteration.

The for-loop

The for-loop has a very simple design…

for (each of these-things){
  do this
}

Technically, we don’t need to add the curly-braced code block, it could be simply…

for (each of these-things) do this

… but it is always better to err on the side of readability - a code block is simply easier to read than a one-liner.

Some example data

Let’s make a simple named vector to work with.

x = c(A = "Alan", B = "Beryl", C = "Carlos")

How to iterate over these-things

What goes into the (each of these-things)? You have three choices here: a positional index (1,2,3), a name (A, B, C) or the items themselves (Alan, Beryl, Carlos). Each time the loop runs, it selects one item from your these-things and “delivers” it into the code block.

Iterate by position

Here’s an example using positional indices. Note that we chose to use seq_along(x) to generate our sequence: 1, 2, 3. Using seq_along(x) and seq_len(length(x)) are considered good practice because they will not throw an error if you happen to have an x with no elements (it happens!) and instead will skip the loop. In our case, we generate our sequence with seq_along(x) which produces 1, 2, 3.

for (i in seq_along(x)){
  cat("Hello,", x[i], "you are number", i, "\n")
}
Hello, Alan you are number 1 
Hello, Beryl you are number 2 
Hello, Carlos you are number 3 

Using positional indices is often very useful, but keep in mind you can control the order of the sequence if you like. Note here how we reverse the order.

for (i in rev(seq_along(x))){
  cat("Hello,", x[i], "you are number", i, "\n")
}
Hello, Carlos you are number 3 
Hello, Beryl you are number 2 
Hello, Alan you are number 1 

Iterate by name

Of course our object with three elements is a named vector. So we could use the names to iterate.

for (letter in names(x)){
  cat("Your name is", x[[letter]], "and your letter is", letter, "\n")
}
Your name is Alan and your letter is A 
Your name is Beryl and your letter is B 
Your name is Carlos and your letter is C 

Here, too, you may arrange the order as you please.

for (letter in c("B", "A", "C")){
  cat("Your name is", x[[letter]], "and your letter is", letter, "\n")
}
Your name is Beryl and your letter is B 
Your name is Alan and your letter is A 
Your name is Carlos and your letter is C 
Warning

Is your vector or list unnamed? What do you think will happen if you try to iterate by name? Will it throw an error? Skip the loop? Burst into flames and delete your hard drive?

Iterate by item

Finally, we can iterate by item instead of by position or name.

for (this_person in x){
  cat("Hi there,", this_person, "I don't know your letter or position\n")
}
Hi there, Alan I don't know your letter or position
Hi there, Beryl I don't know your letter or position
Hi there, Carlos I don't know your letter or position

Note that iterating by item means we lose the context of order - we don’t know in the code block anything about where in x we come from, and often it doesn’t matter.

Iteration with lapply and sapply

lapply and sapply are very similar functions. Each accepts a multi-element object over which it iterates. At each iteration it applies a function of your choosing to the current element. This function you choose mustr accept one argument. Once again, we can iterate over positional indices, names (if any) or over the items themselves.

Unlike for-loops, the *apply (where * is “l”, “s”, etc) functions actually return something to you, which opens up all sorts of programming possibilities. Here’s a bit of pseudo-code using a generic *apply. The value of result will hold one modified_thing for every element of these_things.

result = *apply(these_things,
                function(this_thing){
                  modified_thing = do_something(this_thing)
                  return(modified_thing))
                })

lapply()

Let’s try it by iterating over names in our example data.

result = lapply(names(x),
                function(name){
                  r = paste("Your name is", x[[name]], "and your name has", 
                            nchar(x[[name]]), "characters", sep = " ")
                  return(r)
                })

So what is in result?

class(result)
[1] "list"

How many things does it have?

length(result)
[1] 3

What are those things?

print(result)
[[1]]
[1] "Your name is Alan and your name has 4 characters"

[[2]]
[1] "Your name is Beryl and your name has 5 characters"

[[3]]
[1] "Your name is Carlos and your name has 6 characters"

The function used doesn’t have to be constructed on the fly (which we call an anonymous or unnamed function), but could be any function listed only by name. This makes it look a bit weird to the untrained eye. But as long as the function accepts one argument, R will deliver that argument for you. Below we used a previously defined function, nchar(), as an example. Note that we switch to iterating over the elements rather than the names of the elements.

result = lapply(x, nchar)
print(result)
$A
[1] 4

$B
[1] 5

$C
[1] 6

That seems to work, but how did the result get the names A, B and C while it didn’t before? Result naming happens because R tries hard to be useful. When you iterate over a named listing, R will transfer those names to the resulting output (very helpful!)

So, lapply produces a list. What about sapply()?

sapply()

sapply() operates just like lapply() except it tries very hard to simplify the output into a vector or matrix if possible. If R can’t figure out how to simplify the output, then it will likely return a list.

result = sapply(x, nchar)

Let’s query this output the same way.

So what is in result?

class(result)
[1] "integer"

How many things does it have?

length(result)
[1] 3

What are those things?

print(result)
A B C 
4 5 6 

Oh, interesting, unlike lapply which produced a named list, sapply produced a named vector - thus “simplified”!

You can suppress the the simplification to make sapply just like lapply.

result = sapply(x, nchar, simplify = FALSE) |>
  print()
$A
[1] 4

$B
[1] 5

$C
[1] 6

So, sapply can produce a simple vector, but it can also produce a list. What about producing a matrix? And do you even want that? (Answer: sometimes!) A matrix is produced when the function produces a vector of more than one element. Below we define an anonymous function that returns a two element vector (name and value). Note that we are now iterating over the names of x rather than the elements of x.

result = sapply(names(x), 
                function(name){
                  return(c(name, x[[name]]))
                })

Let’s query the output…

str(result)
 chr [1:2, 1:3] "A" "Alan" "B" "Beryl" "C" "Carlos"
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:3] "A" "B" "C"

Well, there we go, a 2-row, 3-column matrix.

print(result)
     A      B       C       
[1,] "A"    "B"     "C"     
[2,] "Alan" "Beryl" "Carlos"
Tip

If you are like most mortals you may be disappointed that the matrix looks sideways. It feels like it should be a column of names (A, B and C) followed by a column of items (Alan, Beryl and Carlos) rather than a row of each. You can switch things to make it look righter using the matrix transpose function t().

t(result)
  [,1] [,2]    
A "A"  "Alan"  
B "B"  "Beryl" 
C "C"  "Carlos"

That feels better somehow.

Groups in a table

It is very common to have a flat table (data.frame) where rows are grouped, not in contiguous rows perhaps, but by a common value in a particular column. In fact, you may have the need to have multiple groupings - nested groups, where we first group by “this” and then within each of those groups we group by “that”. There’s no limited on how many nested groups you might envision in a table. The dplyr package and friends provide easy utilities for iterating over these groups. In this case, the process is loosely described as mapping a portion of a table (a group) to a function. There are a number of these utilities, but we’ll focus on group_map().

An example table

We have created a function that produces a table in the E pluribus unum tutorial. The table is synthetic data about a study on fictitious guppies.

source("setup.R")
here() starts at /Users/ben/Library/CloudStorage/Dropbox/code/projects/handytandy
x = read_guppy() |>
  print()
# A tibble: 105 × 11
   site_id       x        y time                researcher shade gps_codes    id
   <chr>     <dbl>    <dbl> <dttm>              <chr>      <dbl> <chr>     <dbl>
 1 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         1
 2 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         2
 3 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         3
 4 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         4
 5 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         5
 6 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         6
 7 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         7
 8 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         8
 9 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         9
10 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q        10
# ℹ 95 more rows
# ℹ 3 more variables: treatment <chr>, count <dbl>, dose <dbl>

Set up the groups

Let’s do something very simple; “for each treatment-shade group, let’s craft a table of the minimum and maximum doses, as well as the mean count.” Note that we use the same iterative language we used above, “for each thing do this”, where in this case “thing” means treatment-shade group.

First, let’s see how many treatment groups we should expect.

ngroups = dplyr::count(x, treatment, shade) |>
  print()
# A tibble: 19 × 3
   treatment shade     n
   <chr>     <dbl> <int>
 1 A            20     3
 2 A            30     3
 3 A            40     3
 4 A            50     9
 5 B            20     1
 6 B            30     7
 7 B            40     5
 8 B            50    12
 9 C            20     4
10 C            30     3
11 C            40     6
12 C            50    11
13 D            30     3
14 D            40     6
15 D            50    10
16 E            20     2
17 E            30     2
18 E            40     5
19 E            50    10

OK, there are 19 treatment-shade groups so we should expect our summary table to have 19 rows (one for each treatment-shade group.) This is good info to have to help check our work.

Next we assign to the table the identity of the grouping variable - that is, which column contains the treatment-shade identifiers

x = dplyr::group_by(x, treatment, shade) |>
  print()
# A tibble: 105 × 11
# Groups:   treatment, shade [19]
   site_id       x        y time                researcher shade gps_codes    id
   <chr>     <dbl>    <dbl> <dttm>              <chr>      <dbl> <chr>     <dbl>
 1 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         1
 2 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         2
 3 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         3
 4 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         4
 5 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         5
 6 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         6
 7 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         7
 8 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         8
 9 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q         9
10 site_01 452032. 4857765. 2020-05-16 07:26:25 JPO           30 c-q-q        10
# ℹ 95 more rows
# ℹ 3 more variables: treatment <chr>, count <dbl>, dose <dbl>

The only change here is that we see a notation, in the header, about the groupings. We can access information about the groupings, including the names of the grouping variables…

dplyr::groups(x)
[[1]]
treatment

[[2]]
shade

… a table of the keys (the values of each group)…

dplyr::group_keys(x)
# A tibble: 19 × 2
   treatment shade
   <chr>     <dbl>
 1 A            20
 2 A            30
 3 A            40
 4 A            50
 5 B            20
 6 B            30
 7 B            40
 8 B            50
 9 C            20
10 C            30
11 C            40
12 C            50
13 D            30
14 D            40
15 D            50
16 E            20
17 E            30
18 E            40
19 E            50

… or even a listing of the row numbers for each group…

dplyr::group_data(x)
# A tibble: 19 × 3
   treatment shade       .rows
   <chr>     <dbl> <list<int>>
 1 A            20         [3]
 2 A            30         [3]
 3 A            40         [3]
 4 A            50         [9]
 5 B            20         [1]
 6 B            30         [7]
 7 B            40         [5]
 8 B            50        [12]
 9 C            20         [4]
10 C            30         [3]
11 C            40         [6]
12 C            50        [11]
13 D            30         [3]
14 D            40         [6]
15 D            50        [10]
16 E            20         [2]
17 E            30         [2]
18 E            40         [5]
19 E            50        [10]

This is perfect! Just what we might need to iterate over each grouping. The first group (treatment = A, shade = 20) has three records which we find at rows 73, 74 and 75. Now we could iterate over each row of the group_data(x) output, fishing out the rows for each and then making our computations. But…

The dplyr provides a set of tools that perform much of what we have seen with lapply() and sapply() - running an iterator, applying a function and returning something. These functions are group_map(), group_walk() and group_modify(). We’ll focus only upon group_map(), but do check out the others. Also, keep in mind there are hundreds of examples and tutorials on the web that you can follow.

Just like with lapply() and sapply(), we need a function to operate upon each group. Unlike lapply() and sapply(), the function we use (or create) must accept two positional arguments. The first, which we call tbl (short for table), contains the data for the group while the second, which we call key contains a tiny table of the key for that group. In our example, the key for each iteration would be one row of the output of group_keys(x) that you see above.

Note

You can call those two arguments whatever you like. We adopt tbl and key only as a convention that we like. You can devise your own convention, or use different argument names every time (as long as your head doesn’t explode.)

Here’s a reminder of our goal: “for each treatment-shade group, let’s craft a table of the minimum and maximum doses, as well as the mean count.” So, our function needs to compute the min and max doses, as well as the mean count. Now key will be a little table with treatment and shade, we could simply mutate that, adding columns for min_dose, max_dose and mean_count. Of course, the actual data for dose and count will be in the first argument, tbl.

summarize_one_group = function(tbl, key){
  key = key |>
    dplyr::mutate(min_dose = min(tbl$dose),
                  max_dose = max(tbl$dose),
                  mean_count = mean(tbl$count))
  return(key)
}

OK, so how do we use the function? It’s easy, much as you did for, say lapply(), you provide the thing to iterate over and the function to apply at each iteration.

result = dplyr::group_map(x, summarize_one_group)

But what is in the result? (Hopefully 19!)

length(result) 
[1] 19

Let’s look at just the first three items returned.

result[seq_len(3)]
[[1]]
# A tibble: 1 × 5
  treatment shade min_dose max_dose mean_count
  <chr>     <dbl>    <dbl>    <dbl>      <dbl>
1 A            20        0      0.8       56.7

[[2]]
# A tibble: 1 × 5
  treatment shade min_dose max_dose mean_count
  <chr>     <dbl>    <dbl>    <dbl>      <dbl>
1 A            30      0.4      0.9       49.7

[[3]]
# A tibble: 1 × 5
  treatment shade min_dose max_dose mean_count
  <chr>     <dbl>    <dbl>    <dbl>      <dbl>
1 A            40        0      0.3         52

OK - so we got back 19 tables, well those are easy to bind into one table.

result = dplyr::bind_rows(result) |>
  print()
# A tibble: 19 × 5
   treatment shade min_dose max_dose mean_count
   <chr>     <dbl>    <dbl>    <dbl>      <dbl>
 1 A            20      0        0.8       56.7
 2 A            30      0.4      0.9       49.7
 3 A            40      0        0.3       52  
 4 A            50      0        1         49.9
 5 B            20      1        1         31  
 6 B            30      0        1         57.1
 7 B            40      0        0.8       48.6
 8 B            50      0        1         45.9
 9 C            20      0        0.9       46  
10 C            30      0.7      1         40.3
11 C            40      0.2      1         40.8
12 C            50      0.2      1         47.7
13 D            30      0.5      0.7       42.3
14 D            40      0.1      0.9       57.7
15 D            50      0.1      1         49.2
16 E            20      0.1      0.7       51.5
17 E            30      0.8      0.8       54.5
18 E            40      0.7      0.9       63.4
19 E            50      0        1         41.6

Well, that was easy. It’s not unusual to compress these steps into one convenient workflow.

result = dplyr::group_by(x, treatment, shade) |>
  dplyr::group_map(summarize_one_group) |>
  dplyr::bind_rows() |>
  print()
# A tibble: 19 × 5
   treatment shade min_dose max_dose mean_count
   <chr>     <dbl>    <dbl>    <dbl>      <dbl>
 1 A            20      0        0.8       56.7
 2 A            30      0.4      0.9       49.7
 3 A            40      0        0.3       52  
 4 A            50      0        1         49.9
 5 B            20      1        1         31  
 6 B            30      0        1         57.1
 7 B            40      0        0.8       48.6
 8 B            50      0        1         45.9
 9 C            20      0        0.9       46  
10 C            30      0.7      1         40.3
11 C            40      0.2      1         40.8
12 C            50      0.2      1         47.7
13 D            30      0.5      0.7       42.3
14 D            40      0.1      0.9       57.7
15 D            50      0.1      1         49.2
16 E            20      0.1      0.7       51.5
17 E            30      0.8      0.8       54.5
18 E            40      0.7      0.9       63.4
19 E            50      0        1         41.6