Tricky files

Some files are easy to read like CSV or Excel or TEXT files. They provide clean tabular data, often with meaningful column names, and we can generally rely upon existing software to read them for us.

But not all files. Sometimes you will encounter files that look easy to read but have a funny hitch like a header section before the table, or a header section followed by binary table (we are looking at you, flow cytometrists!)

Here we have a file that is simply a small header section followed by a plain old CSV table with 18 records. Here are the first few lines followed by the last 2.

## Guppy studies
SITE: site_01
COORDS (x y crs): 452032.1 4857765.1 EPSG:26919
TIME: 2020-05-16T07:26:25 UTC
## data
id,treatment,count,dose
1,C,32,0.7
2,B,71,0.5
3,D,54,0.6
4,C,57,0.7
    .
    .
    .
17,D,37,0.5
18,E,60,0.8

Now we have table readers galore to chose among including read.csv and reader::read_csv. Each includes a skip argument to skip over the header which we can use to read the table alone.

Read the table

source("setup.R")

here() starts at /Users/ben/Library/CloudStorage/Dropbox/code/projects/handytandy

x = readr::read_csv("data/guppy/site_01/site_01.gup",
                    show_col_types = FALSE,
                    skip = 5,
                    col_names = TRUE) |>
  dplyr::glimpse()

Rows: 18
Columns: 4
$ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
$ treatment <chr> "C", "B", "D", "C", "C", "A", "B", "B", "A", "B", "B", "A", …
$ count     <dbl> 32, 71, 54, 57, 32, 55, 66, 61, 60, 33, 67, 34, 42, 60, 49, …
$ dose      <dbl> 0.7, 0.5, 0.6, 0.7, 1.0, 0.4, 0.0, 0.0, 0.6, 0.4, 0.0, 0.9, …

So that’s pretty easy, but what happens if it is important to attach some of the header info to the table we read? How do we read that and attach it as a attribute to the table?

Read the header

We could follow a different path by reading the first 4 lines as a character (string) vector and parsing what we need for the text. Then we could use the above to read the remainder of the file as a table.

ss = readLines("data/guppy/site_01/site_01.gup", n = 4) |>
  stringr::str_split(stringr::fixed(": "), n = 2)
ss = ss[-1]    # we don't need the first line
ss

[[1]]
[1] "SITE"    "site_01"

[[2]]
[1] "COORDS (x y crs)"              "452032.1 4857765.1 EPSG:26919"

[[3]]
[1] "TIME"                    "2020-05-16T07:26:25 UTC"

str_split() always produces a list with one element for each line of text.

Parse the header

Now we have a list with one element per line of text, where each element has 2 elements itself: the string before “:” and the string after “:” . Extracting the site is easy,

site = ss[[1]][2]  # the second element of the first element
site

[1] "site_01"

and the time should be a straight forward conversion to a date-time class.

time = ss[[3]][2] |>  # the second element of the third element
  as.POSIXct(format = "%Y-%m-%dT%H:%M:%S UTC", tz = "UTC")
time

[1] "2020-05-16 07:26:25 UTC"

But what of the second line? It has three bits of info: two numerics (x and y) and one string (crs). Well, first let’s extract them.

                         # the second element of the second element
xyc = stringr::str_split(ss[[2]][2], stringr::fixed(" "), n = 3)
xyc

[[1]]
[1] "452032.1"   "4857765.1"  "EPSG:26919"

Keep in mind the str_split() returns a list with one element per line of input. So this should be easy to make a list we can subsequently attach to the table as an attribute.

xcoord = xyc[[1]][1] |> as.numeric()  # first element of first element
ycoord = xyc[[1]][2] |> as.numeric()  # second element of first element
crs = xyc[[1]][3]                     # third element of first element

Add as an attribute

You can have anything as an attribute to an object using the attr() function. Attributes are small bits of metadata associated with the object to which it is attached. Whether it is a good idea add an attribute is often worth debating especially if the attribute is large or you need ready access to it. But in the case it is a small bit of info and may not be useful in an immediate sense. Below we attach a “metadata” attribute to our table, x.

my_atts = list(site_id = site,
               time = time,
               x = xcoord,
               y = ycoord,
               crs = crs)

attr(x, "metadata") = my_atts

You can get all of the attributes (prepare yourself - it can be a lot!) using attributes(). Keep in mind that other steps may have added attributes to your table, too.

attributes(x)

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18

$names
[1] "id"        "treatment" "count"     "dose"     

$spec
cols(
  id = col_double(),
  treatment = col_character(),
  count = col_double(),
  dose = col_double()
)

$problems
<pointer: 0x600000280e50>

$class
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

$metadata
$metadata$site_id
[1] "site_01"

$metadata$time
[1] "2020-05-16 07:26:25 UTC"

$metadata$x
[1] 452032.1

$metadata$y
[1] 4857765

$metadata$crs
[1] "EPSG:26919"

But, there at the end, you can see your attributes in ‘metadata’. We can get those back if needed.

my_atts = attr(x, "metadata")
my_atts

$site_id
[1] "site_01"

$time
[1] "2020-05-16 07:26:25 UTC"

$x
[1] 452032.1

$y
[1] 4857765

$crs
[1] "EPSG:26919"

In another context

In the E pluribus unum tutorial we will read these files in a slightly different way, but we use the same approach - read the header as a block of text and then read the remainder as a table. What we do with the header info in that other context is different than here. In that other context we only add the crs as an attribute. site_id, time, x and y are not added as attributes, but instead are added as new variables (columns) in the table.

x = x |>
  dplyr::mutate(
    site_id = site,
    time = time,
    x = xcoord,
    y = ycoord, 
    .before = 1) |>
  dplyr::glimpse()

Rows: 18
Columns: 8
$ site_id   <chr> "site_01", "site_01", "site_01", "site_01", "site_01", "site…
$ time      <dttm> 2020-05-16 07:26:25, 2020-05-16 07:26:25, 2020-05-16 07:26:…
$ x         <dbl> 452032.1, 452032.1, 452032.1, 452032.1, 452032.1, 452032.1, …
$ y         <dbl> 4857765, 4857765, 4857765, 4857765, 4857765, 4857765, 485776…
$ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
$ treatment <chr> "C", "B", "D", "C", "C", "A", "B", "B", "A", "B", "B", "A", …
$ count     <dbl> 32, 71, 54, 57, 32, 55, 66, 61, 60, 33, 67, 34, 42, 60, 49, …
$ dose      <dbl> 0.7, 0.5, 0.6, 0.7, 1.0, 0.4, 0.0, 0.0, 0.6, 0.4, 0.0, 0.9, …

It’s just a different approach and may seem, at this time, like a silly waste of data space since every row has the same values for site_id, time, x and y. But in that other context it is a gold mine.