source("setup.R")
= "Mola mola" SPECIES
Observations
1 Obtaining observational data
Follow this wiki page on obtaining data from OBIS. Keep in mind that you will probably want a species with sufficient number of records in the northwest Atlantic. Just what constitutes “sufficient” is probably subject to some debate, but a couple of hundred as a minumum will be helpful for learning. One thing that might help is to be on alert species that are only congregate in one area such as right along the shoreline or only appear in a few months of the year. It isn’t that those species are not worthy of study, but they may make the learning process harder.
You should feel free to get the data for a couple of different species, if one becomes a headache with our given resources, then you can switch easily to another.
2 What we need
We need a dataset that covers the same area and time period that the Brickman data covers. We need to have some confidence that the observations are of living creatures in their natural habitat. OBIS serves a curated data set, but that doesn’t mean it doesn’t have errors and it certainly doesn’t mean it is properly vetted for our purposes. What follows is a tour through your data with a series of pauses to look at different variables in your data set. At some of these pauses, we may decide to drop some records which will make the data set shrink in size.
3 Tour of your data
It is SO IMPORTANT to have a really good handle on your data. To get that handle you have to explore it. There is a branch of data science devoted to data exploration called Exploratory Data Analysis. We’ll explore your data here, but we assume that you have reviewed and tried your hand with the examples in the wiki for tabular data, observations, the coastlines and the Brickman data. Even if you have walked through these tutorials you may find yourself stumped and stymied. That’s all part of the learning process - just keep moving, inquiring and trying.
4 Setup
As always, we start by running our setup function. Start RStudio/R, and relaod your project with the menu File > Recent Projects
. Then source setup.R
. We’ll also assign a new variable with our species name - we do that so it’s easy to substitute in another species if needed. We make it ALL CAPS so that it reminds us that it is more like a constant than a variable.
5 Observations
Next is to read in the observations you have already downloaded for that species.
= read_obis(SPECIES)
obs obs
Simple feature collection with 9526 features and 7 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
Geodetic CRS: WGS 84
# A tibble: 9,526 × 8
id basisOfRecord eventDate year month eventTime individualCount
* <chr> <chr> <date> <dbl> <chr> <chr> <dbl>
1 00040fa1-7acd… HumanObserva… 2016-08-07 2016 Aug 16:15:00 1
2 000b42de-fbb4… HumanObserva… 2008-07-11 2008 Jul 14:50:38Z 1
3 0011e88f-7e7e… HumanObserva… 1980-05-23 1980 May 12:39:00 1
4 0025033b-f57a… HumanObserva… 1979-09-16 1979 Sep 14:13:00 1
5 00390e97-9607… HumanObserva… 2017-06-15 2017 Jun 10:07:13 1
6 003abd48-a98a… PreservedSpe… 2017-11-15 2017 Nov <NA> NA
7 00406908-529c… HumanObserva… 2015-06-26 2015 Jun 16:46:47Z 1
8 004dbc2b-2236… HumanObserva… 2015-06-27 2015 Jun 16:12:33Z 1
9 0050a981-803c… HumanObserva… 2006-09-17 2006 Sep 15:19:16Z 1
10 0053298d-2d6a… HumanObserva… 2005-07-21 2005 Jul <NA> 1
# ℹ 9,516 more rows
# ℹ 1 more variable: geom <POINT [°]>
The print out of the table only shows the first 10 rows (so your screen doesn’t get filled up), and it tells you how many records you have. A simple way to keep track of the number of records is to use the dim()
functions which returns the number of rows and number of columns. I’m going to save the outout so we can compare after all of the filtering.
= dim(obs)
dim_start dim_start
[1] 9526 8
5.1 basisOfRecord
Next we should examine the basisOfRecord
variable to get an understanding of how these observations were made.
|> count(basisOfRecord) obs
Simple feature collection with 4 features and 2 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
Geodetic CRS: WGS 84
# A tibble: 4 × 3
basisOfRecord n geom
* <chr> <int> <GEOMETRY [°]>
1 HumanObservation 9354 MULTIPOINT ((-65.07 42.68), (-65.067 42.65), (-6…
2 NomenclaturalChecklist 1 POINT (-65.80602 44.97985)
3 Occurrence 1 POINT (-65.2852 42.6243)
4 PreservedSpecimen 170 MULTIPOINT ((-67.05534 45.09908), (-66.35 45.133…
If you are using a different species you may have different values for basisOfRecord
. Let’s take a closer look at the complete records for one from each group.
= obs |>
human filter(basisOfRecord == "HumanObservation") |>
slice(1) |>
browse_obis()
Please point your browser to the following url:
https://api.obis.org/v3/occurrence/00040fa1-7acd-4731-bf1e-6dc16e30c7d4
= obs |>
preserved filter(basisOfRecord == "PreservedSpecimen") |>
slice(1) |>
browse_obis()
Please point your browser to the following url:
https://api.obis.org/v3/occurrence/003abd48-a98a-4c2f-adc2-8f1d6f71dfa1
= obs |>
checklist filter(basisOfRecord == "NomenclaturalChecklist") |>
slice(1) |>
browse_obis()
Please point your browser to the following url:
https://api.obis.org/v3/occurrence/1b967631-4d90-44d0-b57e-cf71c554ee5c
= obs |>
occurrence filter(basisOfRecord == "Occurrence") |>
slice(1) |>
browse_obis()
Please point your browser to the following url:
https://api.obis.org/v3/occurrence/d6e7882e-a850-435d-a546-73adaf625031
Next let’s think about what our minimum requirements might be in oirder to build a model. To answer that we need to think about our environmental covariates in the Brickman data](https://github.com/BigelowLab/ColbyForecasting2025/wiki/Brickman). That data has dimensions of x (longitude), y (latitude) and month. In order to match obseravtions with that data, our observations must be complete in those three variables. Let’s take a look at a summary of the observations which will indicate the number of elements missing in each variable.
summary(obs)
id basisOfRecord eventDate year
Length:9526 Length:9526 Min. :1932-09-15 Min. :1932
Class :character Class :character 1st Qu.:2003-10-02 1st Qu.:2003
Mode :character Mode :character Median :2009-07-11 Median :2009
Mean :2006-10-02 Mean :2006
3rd Qu.:2016-11-05 3rd Qu.:2016
Max. :2021-10-14 Max. :2021
NA's :7 NA's :7
month eventTime individualCount geom
Length:9526 Length:9526 Min. : 1.000 POINT :9526
Class :character Class :character 1st Qu.: 1.000 epsg:4326 : 0
Mode :character Mode :character Median : 1.000 +proj=long...: 0
Mean : 1.112
3rd Qu.: 1.000
Max. :25.000
NA's :318
5.2 eventDate
For Mola mola there are some rows where eventDate
is NA
. We need to filter those. The filter function looks for a vector of TRUE/FALSE values - one for each row. In our case, we test the eventDate
column to see if it is NA
, but then we reverse the TRUE/FALSE logical with the preceding !
(pronounded “bang!”). This we retain only the rows where eventDate is not
NA`, and then we print the summary again.
= obs |>
obs filter(!is.na(eventDate))
summary(obs)
id basisOfRecord eventDate year
Length:9519 Length:9519 Min. :1932-09-15 Min. :1932
Class :character Class :character 1st Qu.:2003-10-02 1st Qu.:2003
Mode :character Mode :character Median :2009-07-11 Median :2009
Mean :2006-10-02 Mean :2006
3rd Qu.:2016-11-05 3rd Qu.:2016
Max. :2021-10-14 Max. :2021
month eventTime individualCount geom
Length:9519 Length:9519 Min. : 1.000 POINT :9519
Class :character Class :character 1st Qu.: 1.000 epsg:4326 : 0
Mode :character Mode :character Median : 1.000 +proj=long...: 0
Mean : 1.112
3rd Qu.: 1.000
Max. :25.000
NA's :315
5.3 individualCount
That’s better, but we still have 315 NA
values for individualCount
. Let’s look at at least one record of those in detail; filter out one, and browse it.
|>
obs filter(is.na(individualCount)) |>
slice(1) |>
browse_obis()
Please point your browser to the following url:
https://api.obis.org/v3/occurrence/003abd48-a98a-4c2f-adc2-8f1d6f71dfa1
Eeek! It’s a carcas that washed up on shore! We checked a number of others, and they are all carcases. Is that a presence? Is that what we model are modeling? If not then we should filer those out.
= obs |>
obs filter(!is.na(individualCount))
summary(obs)
id basisOfRecord eventDate year
Length:9204 Length:9204 Min. :1932-09-15 Min. :1932
Class :character Class :character 1st Qu.:2003-07-26 1st Qu.:2003
Mode :character Mode :character Median :2009-07-11 Median :2009
Mean :2006-08-17 Mean :2006
3rd Qu.:2016-11-05 3rd Qu.:2016
Max. :2021-10-14 Max. :2021
month eventTime individualCount geom
Length:9204 Length:9204 Min. : 1.000 POINT :9204
Class :character Class :character 1st Qu.: 1.000 epsg:4326 : 0
Mode :character Mode :character Median : 1.000 +proj=long...: 0
Mean : 1.112
3rd Qu.: 1.000
Max. :25.000
Well now one has to wonder about a single observation of 25 animals. Let’s check that out.
|>
obs filter(individualCount == 25) |>
browse_obis()
Please point your browser to the following url:
https://api.obis.org/v3/occurrence/c907349a-2c52-4a51-a69a-5a338c5d492a
OK, that seems legitmate. And it is possible, Mola mola can congregate for feeding, mating and possibly for karaoke parties.
5.4 year
We know that the “current” climate scenario for the Brickman model data define “current” as the 1982-2013 window. It’s just an average, and if you have values from 1970 to the current year, you probably are safe in including them. But do your observations fall into those years? Let’s make a plot of the counts per year, with dashed lines shown the Brickman “current” cliamtology period.
ggplot(data = obs,
mapping = aes(x = year)) +
geom_bar() +
geom_vline(xintercept = c(1982, 2013), linetype = "dashed") +
labs(title = "Counts per year")
For this species, it seem like it is only the record from 1932 that might be a stretch, so let’s filter that out by rejecting records before 1970. This time, instead of asking for a sumamry, we’ll print the dimensions (rows, columns) of the table.
= obs |>
obs filter(year >= 1970)
dim(obs)
[1] 9203 8
That’s still a lot of records. Now let’s check out the distribution across the months of the year.
5.5 month
We will be making models and predictions for each month of the for the 4 future projection climates. Species and observers do show some seasonality, but it that seasonality so extreme that it might be impossible to model some months because of sparse data? Let’s make a plot of the counts per month.
ggplot(data = obs,
mapping = aes(x = month)) +
geom_bar() +
labs(title = "Counts per month")
Oh, rats! By default ggplot
plots in alpha-numeric order, which scrambles our month order. To fix that we have to convert the month
in a factor type while specifying the order of the factors, and we’ll use the mutate()
function to help us.
= obs |>
obs mutate(month = factor(month, levels = month.abb))
ggplot(data = obs,
mapping = aes(x = month)) +
geom_bar() +
labs(title = "Counts per month")
That’s better! So, it may be the for Mola mola we might not be able to successfully model in the cold winter months. That’s good to keep in mind.
5.6 geometry
Last, but certainly not least, we should consider the possibility that some observations might be on shore. It happens! We already know that some records included fish that were washed up on shore. It’s possible someone mis-keyed the longitude or latitude when entering the vaklues into the database. It’s alos possible that some observations fall just outside the areas where the Brickman data has values. To look for these points, we’ll load the Brickman mask (defines land vs water. Well, really it defines data vs no-data), and use that for further filtering.
We need to load the Brickman database, and then filter it for the static variable called “mask”.
= brickman_database() |>
db filter(scenario == "STATIC", var == "mask")
= read_brickman(db)
mask mask
stars object with 2 dimensions and 1 attribute
attribute(s):
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
mask 1 1 1 1 1 1 4983
dimension(s):
from to offset delta refsys point x/y
x 1 121 -74.93 0.08226 WGS 84 FALSE [x]
y 1 89 46.08 -0.08226 WGS 84 FALSE [y]
Let’s see what our mask looks like with the observations drizzled on top. Because the mask only has values of 1 (data) or NA
(no-data). You’ll note that we only want to plot the locations of the observations, so we strip obs
of everyhting except its geometery.
plot(mask, breaks = "equal", axes = TRUE, reset = FALSE)
plot(st_geometry(obs), pch = ".", add = TRUE)
Maybe with proper with squinting we can see some that faal into no-data areas. The sure-fire way to tell is to extract the mask values at the point locations.
= extract_brickman(mask, obs)
hitOrMiss hitOrMiss
# A tibble: 9,203 × 3
point name value
<chr> <chr> <dbl>
1 p0001 mask 1
2 p0002 mask 1
3 p0003 mask 1
4 p0004 mask 1
5 p0005 mask 1
6 p0006 mask 1
7 p0007 mask 1
8 p0008 mask 1
9 p0009 mask 1
10 p0010 mask 1
# ℹ 9,193 more rows
OK, let’s tally the “value” variable.
count(hitOrMiss, value)
# A tibble: 2 × 2
value n
<dbl> <int>
1 1 9170
2 NA 33
Ooooo, 33 records in obs
don’t line up with values in the mask (or in any Brickman data). We should filter those out; we’ll do so with a filter()
. Note that we a “reaching” into the hitOrMiss
table to access the value
column when we use this hitOrMiss$value
. Let’s figure out how many records we have dropped with all of this filtering.
= obs |>
obs filter(!is.na(hitOrMiss$value))
= dim(obs)
dim_end
= dim_start[1] - dim_end[1]
dropped_records dropped_records
[1] 356
So, we dropped 356 records which is about 3.7% of the raw OBIS data. Is it worth all that to drop just 4% of the data? Yes! Models are like all things computer… if you put garbage in you should expect to get garbage back out.
6 Recap
We have explored a data set, in particular for Mola mola; your species may present you with unique challenges. Our goal is to winnow the original data set down to just the most reliable observations for modeling.
We went through many steps to filter out records that won’t help use model. We’ll need that filtered data many-many-many times in the days ahead. Wouldn’t it be nice if we could sweep all of those filtering steps into a single function, call it read_observations()
, that simple took care of it all for us? Yes - that would be really nice!
Open the “read_observations.R” file you’ll find in the “functions” directory. We have started it for you. Edit the function so that it appropriately filters your species data set by adding optional arguments (like minimum_year
has been added). And then adding the code steps needed to implement that filter.
Not every filter needs user input. For instance, eventDate
can’t be NA
, and all points must fall within the area covered by the Brickman data. So you can automatically add those filters without any user options.
On the other hand, filtering by basisOfRecord
or individualCount
might need more flexibility, especially if you might switch to other species.
SPeaking of which, we provided scientificname
with a default value - we chose “Mola mola” because we are a bit lazy. If you are feeling lazy, you can change the default to your own species.
As you build your function, pause every so often and run the following to test things out.
source("setup.R")
obs = read_observations()