Observations

1 Obtaining observational data

Follow this wiki page on obtaining data from OBIS. Keep in mind that you will probably want a species with sufficient number of records in the northwest Atlantic. Just what constitutes “sufficient” is probably subject to some debate, but a couple of hundred as a minumum will be helpful for learning. One thing that might help is to be on alert species that are only congregate in one area such as right along the shoreline or only appear in a few months of the year. It isn’t that those species are not worthy of study, but they may make the learning process harder.

You should feel free to get the data for a couple of different species, if one becomes a headache with our given resources, then you can switch easily to another.

2 What we need

We need a dataset that covers the same area and time period that the Brickman data covers. We need to have some confidence that the observations are of living creatures in their natural habitat. OBIS serves a curated data set, but that doesn’t mean it doesn’t have errors and it certainly doesn’t mean it is properly vetted for our purposes. What follows is a tour through your data with a series of pauses to look at different variables in your data set. At some of these pauses, we may decide to drop some records which will make the data set shrink in size.

3 Tour of your data

It is SO IMPORTANT to have a really good handle on your data. To get that handle you have to explore it. There is a branch of data science devoted to data exploration called Exploratory Data Analysis. We’ll explore your data here, but we assume that you have reviewed and tried your hand with the examples in the wiki for tabular data, observations, the coastlines and the Brickman data. Even if you have walked through these tutorials you may find yourself stumped and stymied. That’s all part of the learning process - just keep moving, inquiring and trying.

4 Setup

As always, we start by running our setup function. Start RStudio/R, and relaod your project with the menu File > Recent Projects. Then source setup.R. We’ll also assign a new variable with our species name - we do that so it’s easy to substitute in another species if needed. We make it ALL CAPS so that it reminds us that it is more like a constant than a variable.

source("setup.R")
SPECIES = "Mola mola"

5 Observations

Next is to read in the observations you have already downloaded for that species.

obs = read_obis(SPECIES)
obs
Simple feature collection with 9526 features and 7 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
Geodetic CRS:  WGS 84
# A tibble: 9,526 × 8
   id             basisOfRecord eventDate   year month eventTime individualCount
 * <chr>          <chr>         <date>     <dbl> <chr> <chr>               <dbl>
 1 00040fa1-7acd… HumanObserva… 2016-08-07  2016 Aug   16:15:00                1
 2 000b42de-fbb4… HumanObserva… 2008-07-11  2008 Jul   14:50:38Z               1
 3 0011e88f-7e7e… HumanObserva… 1980-05-23  1980 May   12:39:00                1
 4 0025033b-f57a… HumanObserva… 1979-09-16  1979 Sep   14:13:00                1
 5 00390e97-9607… HumanObserva… 2017-06-15  2017 Jun   10:07:13                1
 6 003abd48-a98a… PreservedSpe… 2017-11-15  2017 Nov   <NA>                   NA
 7 00406908-529c… HumanObserva… 2015-06-26  2015 Jun   16:46:47Z               1
 8 004dbc2b-2236… HumanObserva… 2015-06-27  2015 Jun   16:12:33Z               1
 9 0050a981-803c… HumanObserva… 2006-09-17  2006 Sep   15:19:16Z               1
10 0053298d-2d6a… HumanObserva… 2005-07-21  2005 Jul   <NA>                    1
# ℹ 9,516 more rows
# ℹ 1 more variable: geom <POINT [°]>

The print out of the table only shows the first 10 rows (so your screen doesn’t get filled up), and it tells you how many records you have. A simple way to keep track of the number of records is to use the dim() functions which returns the number of rows and number of columns. I’m going to save the outout so we can compare after all of the filtering.

dim_start = dim(obs)
dim_start
[1] 9526    8

5.1 basisOfRecord

Next we should examine the basisOfRecord variable to get an understanding of how these observations were made.

obs |> count(basisOfRecord)
Simple feature collection with 4 features and 2 fields
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
Geodetic CRS:  WGS 84
# A tibble: 4 × 3
  basisOfRecord              n                                              geom
* <chr>                  <int>                                    <GEOMETRY [°]>
1 HumanObservation        9354 MULTIPOINT ((-65.07 42.68), (-65.067 42.65), (-6…
2 NomenclaturalChecklist     1                        POINT (-65.80602 44.97985)
3 Occurrence                 1                          POINT (-65.2852 42.6243)
4 PreservedSpecimen        170 MULTIPOINT ((-67.05534 45.09908), (-66.35 45.133…

If you are using a different species you may have different values for basisOfRecord. Let’s take a closer look at the complete records for one from each group.

human = obs |>
  filter(basisOfRecord == "HumanObservation") |>
  slice(1) |>
  browse_obis()
Please point your browser to the following url: 
https://api.obis.org/v3/occurrence/00040fa1-7acd-4731-bf1e-6dc16e30c7d4
preserved = obs |>
  filter(basisOfRecord == "PreservedSpecimen") |>
  slice(1) |>
  browse_obis()
Please point your browser to the following url: 
https://api.obis.org/v3/occurrence/003abd48-a98a-4c2f-adc2-8f1d6f71dfa1
checklist = obs |>
  filter(basisOfRecord == "NomenclaturalChecklist") |>
  slice(1) |>
  browse_obis()
Please point your browser to the following url: 
https://api.obis.org/v3/occurrence/1b967631-4d90-44d0-b57e-cf71c554ee5c
occurrence = obs |>
  filter(basisOfRecord == "Occurrence") |>
  slice(1) |>
  browse_obis()
Please point your browser to the following url: 
https://api.obis.org/v3/occurrence/d6e7882e-a850-435d-a546-73adaf625031

Next let’s think about what our minimum requirements might be in oirder to build a model. To answer that we need to think about our environmental covariates in the Brickman data](https://github.com/BigelowLab/ColbyForecasting2025/wiki/Brickman). That data has dimensions of x (longitude), y (latitude) and month. In order to match obseravtions with that data, our observations must be complete in those three variables. Let’s take a look at a summary of the observations which will indicate the number of elements missing in each variable.

summary(obs)
      id            basisOfRecord        eventDate               year     
 Length:9526        Length:9526        Min.   :1932-09-15   Min.   :1932  
 Class :character   Class :character   1st Qu.:2003-10-02   1st Qu.:2003  
 Mode  :character   Mode  :character   Median :2009-07-11   Median :2009  
                                       Mean   :2006-10-02   Mean   :2006  
                                       3rd Qu.:2016-11-05   3rd Qu.:2016  
                                       Max.   :2021-10-14   Max.   :2021  
                                       NA's   :7            NA's   :7     
    month            eventTime         individualCount             geom     
 Length:9526        Length:9526        Min.   : 1.000   POINT        :9526  
 Class :character   Class :character   1st Qu.: 1.000   epsg:4326    :   0  
 Mode  :character   Mode  :character   Median : 1.000   +proj=long...:   0  
                                       Mean   : 1.112                       
                                       3rd Qu.: 1.000                       
                                       Max.   :25.000                       
                                       NA's   :318                          

5.2 eventDate

For Mola mola there are some rows where eventDate is NA. We need to filter those. The filter function looks for a vector of TRUE/FALSE values - one for each row. In our case, we test the eventDate column to see if it is NA, but then we reverse the TRUE/FALSE logical with the preceding ! (pronounded “bang!”). This we retain only the rows where eventDate is notNA`, and then we print the summary again.

obs = obs |>
  filter(!is.na(eventDate))
summary(obs)
      id            basisOfRecord        eventDate               year     
 Length:9519        Length:9519        Min.   :1932-09-15   Min.   :1932  
 Class :character   Class :character   1st Qu.:2003-10-02   1st Qu.:2003  
 Mode  :character   Mode  :character   Median :2009-07-11   Median :2009  
                                       Mean   :2006-10-02   Mean   :2006  
                                       3rd Qu.:2016-11-05   3rd Qu.:2016  
                                       Max.   :2021-10-14   Max.   :2021  
                                                                          
    month            eventTime         individualCount             geom     
 Length:9519        Length:9519        Min.   : 1.000   POINT        :9519  
 Class :character   Class :character   1st Qu.: 1.000   epsg:4326    :   0  
 Mode  :character   Mode  :character   Median : 1.000   +proj=long...:   0  
                                       Mean   : 1.112                       
                                       3rd Qu.: 1.000                       
                                       Max.   :25.000                       
                                       NA's   :315                          

5.3 individualCount

That’s better, but we still have 315 NA values for individualCount. Let’s look at at least one record of those in detail; filter out one, and browse it.

obs |>
  filter(is.na(individualCount)) |>
  slice(1) |>
  browse_obis()
Please point your browser to the following url: 
https://api.obis.org/v3/occurrence/003abd48-a98a-4c2f-adc2-8f1d6f71dfa1

Eeek! It’s a carcas that washed up on shore! We checked a number of others, and they are all carcases. Is that a presence? Is that what we model are modeling? If not then we should filer those out.

obs = obs |>
  filter(!is.na(individualCount))
summary(obs)
      id            basisOfRecord        eventDate               year     
 Length:9204        Length:9204        Min.   :1932-09-15   Min.   :1932  
 Class :character   Class :character   1st Qu.:2003-07-26   1st Qu.:2003  
 Mode  :character   Mode  :character   Median :2009-07-11   Median :2009  
                                       Mean   :2006-08-17   Mean   :2006  
                                       3rd Qu.:2016-11-05   3rd Qu.:2016  
                                       Max.   :2021-10-14   Max.   :2021  
    month            eventTime         individualCount             geom     
 Length:9204        Length:9204        Min.   : 1.000   POINT        :9204  
 Class :character   Class :character   1st Qu.: 1.000   epsg:4326    :   0  
 Mode  :character   Mode  :character   Median : 1.000   +proj=long...:   0  
                                       Mean   : 1.112                       
                                       3rd Qu.: 1.000                       
                                       Max.   :25.000                       

Well now one has to wonder about a single observation of 25 animals. Let’s check that out.

obs |>
  filter(individualCount == 25) |>
  browse_obis()
Please point your browser to the following url: 
https://api.obis.org/v3/occurrence/c907349a-2c52-4a51-a69a-5a338c5d492a

OK, that seems legitmate. And it is possible, Mola mola can congregate for feeding, mating and possibly for karaoke parties.

5.4 year

We know that the “current” climate scenario for the Brickman model data define “current” as the 1982-2013 window. It’s just an average, and if you have values from 1970 to the current year, you probably are safe in including them. But do your observations fall into those years? Let’s make a plot of the counts per year, with dashed lines shown the Brickman “current” cliamtology period.

ggplot(data = obs,
       mapping = aes(x = year)) + 
  geom_bar() + 
  geom_vline(xintercept = c(1982, 2013), linetype = "dashed") + 
  labs(title = "Counts per year")

For this species, it seem like it is only the record from 1932 that might be a stretch, so let’s filter that out by rejecting records before 1970. This time, instead of asking for a sumamry, we’ll print the dimensions (rows, columns) of the table.

obs = obs |>
  filter(year >= 1970)
dim(obs)
[1] 9203    8

That’s still a lot of records. Now let’s check out the distribution across the months of the year.

5.5 month

We will be making models and predictions for each month of the for the 4 future projection climates. Species and observers do show some seasonality, but it that seasonality so extreme that it might be impossible to model some months because of sparse data? Let’s make a plot of the counts per month.

ggplot(data = obs,
       mapping = aes(x = month)) + 
  geom_bar() + 
  labs(title = "Counts per month")

Oh, rats! By default ggplot plots in alpha-numeric order, which scrambles our month order. To fix that we have to convert the month in a factor type while specifying the order of the factors, and we’ll use the mutate() function to help us.

obs = obs |>
  mutate(month = factor(month, levels = month.abb))

ggplot(data = obs,
       mapping = aes(x = month)) + 
  geom_bar() + 
  labs(title = "Counts per month")

That’s better! So, it may be the for Mola mola we might not be able to successfully model in the cold winter months. That’s good to keep in mind.

5.6 geometry

Last, but certainly not least, we should consider the possibility that some observations might be on shore. It happens! We already know that some records included fish that were washed up on shore. It’s possible someone mis-keyed the longitude or latitude when entering the vaklues into the database. It’s alos possible that some observations fall just outside the areas where the Brickman data has values. To look for these points, we’ll load the Brickman mask (defines land vs water. Well, really it defines data vs no-data), and use that for further filtering.

We need to load the Brickman database, and then filter it for the static variable called “mask”.

db = brickman_database() |>
  filter(scenario == "STATIC", var == "mask")
mask = read_brickman(db)
mask
stars object with 2 dimensions and 1 attribute
attribute(s):
      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
mask     1       1      1    1       1    1 4983
dimension(s):
  from  to offset    delta refsys point x/y
x    1 121 -74.93  0.08226 WGS 84 FALSE [x]
y    1  89  46.08 -0.08226 WGS 84 FALSE [y]

Let’s see what our mask looks like with the observations drizzled on top. Because the mask only has values of 1 (data) or NA (no-data). You’ll note that we only want to plot the locations of the observations, so we strip obs of everyhting except its geometery.

plot(mask, breaks = "equal", axes = TRUE, reset = FALSE)
plot(st_geometry(obs), pch = ".", add = TRUE)

Maybe with proper with squinting we can see some that faal into no-data areas. The sure-fire way to tell is to extract the mask values at the point locations.

hitOrMiss = extract_brickman(mask, obs)
hitOrMiss
# A tibble: 9,203 × 3
   point name  value
   <chr> <chr> <dbl>
 1 p0001 mask      1
 2 p0002 mask      1
 3 p0003 mask      1
 4 p0004 mask      1
 5 p0005 mask      1
 6 p0006 mask      1
 7 p0007 mask      1
 8 p0008 mask      1
 9 p0009 mask      1
10 p0010 mask      1
# ℹ 9,193 more rows

OK, let’s tally the “value” variable.

count(hitOrMiss, value)
# A tibble: 2 × 2
  value     n
  <dbl> <int>
1     1  9170
2    NA    33

Ooooo, 33 records in obs don’t line up with values in the mask (or in any Brickman data). We should filter those out; we’ll do so with a filter(). Note that we a “reaching” into the hitOrMiss table to access the value column when we use this hitOrMiss$value. Let’s figure out how many records we have dropped with all of this filtering.

obs = obs |>
  filter(!is.na(hitOrMiss$value))
dim_end = dim(obs)

dropped_records = dim_start[1] - dim_end[1]
dropped_records
[1] 356

So, we dropped 356 records which is about 3.7% of the raw OBIS data. Is it worth all that to drop just 4% of the data? Yes! Models are like all things computer… if you put garbage in you should expect to get garbage back out.

6 Recap

We have explored a data set, in particular for Mola mola; your species may present you with unique challenges. Our goal is to winnow the original data set down to just the most reliable observations for modeling.

Coding Assignment

We went through many steps to filter out records that won’t help use model. We’ll need that filtered data many-many-many times in the days ahead. Wouldn’t it be nice if we could sweep all of those filtering steps into a single function, call it read_observations(), that simple took care of it all for us? Yes - that would be really nice!

Open the “read_observations.R” file you’ll find in the “functions” directory. We have started it for you. Edit the function so that it appropriately filters your species data set by adding optional arguments (like minimum_year has been added). And then adding the code steps needed to implement that filter.

Not every filter needs user input. For instance, eventDate can’t be NA, and all points must fall within the area covered by the Brickman data. So you can automatically add those filters without any user options.

On the other hand, filtering by basisOfRecord or individualCount might need more flexibility, especially if you might switch to other species.

SPeaking of which, we provided scientificname with a default value - we chose “Mola mola” because we are a bit lazy. If you are feeling lazy, you can change the default to your own species.

As you build your function, pause every so often and run the following to test things out.

source("setup.R")
obs = read_observations()
Back to top