Many files

It can be a challenge to face many files to work with. Often they are scattered across a number of directories, and it can be tempting for a novice to manually read each one in a script, and then try to bind them together. Even worse, sometimes files can less than cooperative for new coders by having complicated or non-standard layouts.

And, of course, invariably we want do handle the contents of the many files as one object, rather than many objects with one object per file.

In this tutorial we walk through some of the techniques available to the coder to simplify the steps to importing multi-file datasets into R (or python!)

Mocked data

We have prepared a mock up of a dataset; it consists of studies of guppies from 5 sites. At each site we generate two files: a “gup” data file and a “YAML” configuration file. Each is a text file, but the .gup file consists of two parts: a header followed by the a CSV style table of data.

Finding files

R-language provides convenient tools for locating files. If you find these don’t suit your needs, then consider using the fs R package which has more features. For this project, we’ll stick to using base R functionality as much as possible.

Here we use the list_guppy() function. Here’s what it finds…

source("setup.R")

here() starts at /Users/ben/Library/CloudStorage/Dropbox/code/projects/handytandy

gup_files = list_guppy() |>
  print()

[1] "/Users/ben/Library/CloudStorage/Dropbox/code/projects/handytandy/data/guppy/site_01/site_01.gup"
[2] "/Users/ben/Library/CloudStorage/Dropbox/code/projects/handytandy/data/guppy/site_02/site_02.gup"
[3] "/Users/ben/Library/CloudStorage/Dropbox/code/projects/handytandy/data/guppy/site_03/site_03.gup"
[4] "/Users/ben/Library/CloudStorage/Dropbox/code/projects/handytandy/data/guppy/site_04/site_04.gup"
[5] "/Users/ben/Library/CloudStorage/Dropbox/code/projects/handytandy/data/guppy/site_05/site_05.gup"

A closer look at thr function

Let’s take a closer look at the function. The built-in list.files() function is the workhorse here.

list_guppy = function(path = here::here("data", "guppy"),
                      pattern = glob2rx("*.gup"),
                      recursive = TRUE){
  list.files(path, pattern = pattern, recursive= TRUE, full.names = TRUE)
}

The function accepts three arguments:

path this is a string providing the pathway to the data files
pattern this is regular expression. We converted a “glob” (wildcard notation) to a “regex” (regular expression) using the handy glob2rx() function where we requested the pattern of “any characters followed by ‘.R’ at the very end”.
recursive tells list.files() to search deeply into subdirectories of path.
full.names results in fully-formed filenames including the path.

Reading the files

We have already written a convenience function, read_guppy(), which reads both the configuration file and the “gup” data file files, and then merges them into one file.

Basic use

We pass in the listing of 5 filenames, and the function returns a single table

x = read_guppy(gup_files) |>
  glimpse()

Rows: 105
Columns: 11
$ site_id    <chr> "site_01", "site_01", "site_01", "site_01", "site_01", "sit…
$ x          <dbl> 452032.1, 452032.1, 452032.1, 452032.1, 452032.1, 452032.1,…
$ y          <dbl> 4857765, 4857765, 4857765, 4857765, 4857765, 4857765, 48577…
$ time       <dttm> 2020-05-16 07:26:25, 2020-05-16 07:26:25, 2020-05-16 07:26…
$ researcher <chr> "JPO", "JPO", "JPO", "JPO", "JPO", "JPO", "JPO", "JPO", "JP…
$ shade      <dbl> 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,…
$ gps_codes  <chr> "c-q-q", "c-q-q", "c-q-q", "c-q-q", "c-q-q", "c-q-q", "c-q-…
$ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ treatment  <chr> "C", "B", "D", "C", "C", "A", "B", "B", "A", "B", "B", "A",…
$ count      <dbl> 32, 71, 54, 57, 32, 55, 66, 61, 60, 33, 67, 34, 42, 60, 49,…
$ dose       <dbl> 0.7, 0.5, 0.6, 0.7, 1.0, 0.4, 0.0, 0.0, 0.6, 0.4, 0.0, 0.9,…

Let’s double check that the number of sites matches the number of files.

dplyr::count(x, site_id)

# A tibble: 5 × 2
  site_id     n
  <chr>   <int>
1 site_01    18
2 site_02    23
3 site_03    29
4 site_04    10
5 site_05    25

We can also read the same data in, but this time request a spatial sf objects, where each row’s x and y coordinates are transformed into spatial POINT.

x = read_guppy(gup_files, form = "sf") |>
  print()

Simple feature collection with 105 features and 9 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 452032.1 ymin: 4857756 xmax: 452066.9 ymax: 4857774
Projected CRS: NAD83 / UTM zone 19N
# A tibble: 105 × 10
   site_id time                researcher shade gps_codes    id treatment count
 * <chr>   <dttm>              <chr>      <dbl> <chr>     <dbl> <chr>     <dbl>
 1 site_01 2020-05-16 07:26:25 JPO           30 c-q-q         1 C            32
 2 site_01 2020-05-16 07:26:25 JPO           30 c-q-q         2 B            71
 3 site_01 2020-05-16 07:26:25 JPO           30 c-q-q         3 D            54
 4 site_01 2020-05-16 07:26:25 JPO           30 c-q-q         4 C            57
 5 site_01 2020-05-16 07:26:25 JPO           30 c-q-q         5 C            32
 6 site_01 2020-05-16 07:26:25 JPO           30 c-q-q         6 A            55
 7 site_01 2020-05-16 07:26:25 JPO           30 c-q-q         7 B            66
 8 site_01 2020-05-16 07:26:25 JPO           30 c-q-q         8 B            61
 9 site_01 2020-05-16 07:26:25 JPO           30 c-q-q         9 A            60
10 site_01 2020-05-16 07:26:25 JPO           30 c-q-q        10 B            33
# ℹ 95 more rows
# ℹ 2 more variables: dose <dbl>, geometry <POINT [m]>

Note

Whether you request a tibble (aka table or data.frame) or an sf object, you can use all of the tidyverse tools for subsequent analyses. If you do your work with spatial analyses in mind, then sf is the way to go.

A closer look

But how does the function work? How does it merge the metadata file with the tabular data?

The process is the same if you provide one filename or many filenames. We have written the read_guppy() function with the comments you see below in the pseudo-code, where each commetn is flagged with a double hash ##. You can study the code alongside the comments to follow along.

for each filename
  check that the file at filename exists
  use string processing to "make" the companion medatdata filename
  read the metadata
    check that the metadata file exists
    read the metadata as a YAML file
    return the metadata
  scan all of the text lines in the filename
    parse the header extracting the bits of info we want
    read the table directly from the text
    mutate the the table to include columns of data from the metadata and the header
  add the crs info as an attribute
  return the table
  
bind all of the tables by row

convert to sf-object if the user so requests

return the table/sf-object

Tip

Reading and binding tables is such a common task it is well worth learning how to write your own functions rather than waiting for a solution to appear online. Once you have done one or two of these it gets to be very easy to do more as you combine little bites into a whole sandwich.