source("setup.R")
Modeling
- There are no passengers on spaceship earth. We are all crew.
- Marshall McLuhan
Modeling starts with a collection of observations (presence and background for us!) and ends up with a collection of coeefficients that can be used with one or more formulas to make a predicition for the past, the present or the future. We are using modeling specifically to make habitat suitability maps for select species under two climate scenarios (RCP45 and RCP85) at two different times (2055 and 2075) in the future.
Models can be used in isolation, or they can be used in combinations, called ensembles. When working with ensembles, we are chosing to create two or more models. When we use that ensemble to make a prediction, each model produces its own prediction and then we combine them (usually something like the average.)
1 Setup
As always, we start by running our setup function. Start RStudio/R, and relaod your project with the menu File > Recent Projects
.
2 Load data - choose a month and sampling approach
Let’s load what we need to build a model for August using the greedy sampling technique. We’ll also need the model configuration (which is “g008”). And we’ll need the covariate data.
= read_model_input(scientificname = "Mola mola",
model_input approach = "greedy",
mon = "Aug")
= read_configuration(version = "g_Aug")
cfg = brickman_database()
db = read_brickman(db |> filter(scenario == "STATIC", var == "depth"))
depth = read_brickman(db |> filter(scenario == "PRESENT", interval == "mon")) covars
Of course we need covariates for August only, for this we can use a function we prepared earlier, prep_model_data()
.
= prep_model_data(model_input, month = "Aug",
variables covars = covars, depth = depth)
variables
Simple feature collection with 7272 features and 8 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -74.89169 ymin: 38.8678 xmax: -65.02004 ymax: 45.21401
Geodetic CRS: WGS 84
# A tibble: 7,272 × 9
class MLD Sbtm SSS SST Tbtm U V
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 presence 5.17 35.0 31.6 23.3 7.50 -0.00161 -0.00340
2 presence 4.25 32.8 30.6 21.6 8.15 -0.00420 -0.00206
3 presence 4.64 34.0 30.7 20.2 7.05 0.00168 0.00148
4 presence 5.58 34.6 30.7 18.8 7.55 0.00267 -0.000410
5 presence 5.04 34.7 30.7 19.0 7.43 -0.00619 -0.00121
6 presence 4.01 32.4 30.6 22.0 8.22 -0.00344 -0.000859
7 presence 4.10 32.9 30.5 21.8 8.34 -0.00565 -0.00226
8 presence 3.82 32.4 30.3 18.2 3.56 -0.00702 -0.00431
9 presence 3.20 32.4 30.6 17.9 5.73 0.000275 -0.00101
10 presence 4.02 32.9 30.6 22.0 8.62 -0.000900 -0.00148
# ℹ 7,262 more rows
# ℹ 1 more variable: geom <POINT [°]>
3 Set up subsampling
We shall set up spatial cross-folds for our data so that we can have the model run repeatedly (without a lot of effort). Spatial cross-folds are just subsamples of the input data set. They are easy to make, and easy to understand with a map.
<- spatial_block_cv(data = variables,
crossfolds v = 3,
n = 5)
autoplot(crossfolds)
You can see that the domain is cut into a 5 x 5 grid with each grid cell assigned to one of three groups (“folds”) randomly. Cross-folds can have varying numbers of points.
4 Building the ensemble of models
Now we have all we need to build a model. We first make what is called a recipe… it’s just a placeholder. Then we build a work flow where we specify the types of models we want along with the crossfold sampling.
4.1 Make a recipe
A recipe at a bare minimum needs to know two things: what is that data it has to work with and what is the relationship among the variables. The latter is expressed as a formula, very similar to how we specify the formula of a line with y = mx + b
or a parabola y ~ ax^2 + bx + c
.
We often think of formulas as left-hand side (LHS) and right-hand side (RHS) equalities. And usually, the LHS is the outcome while the RHS is about the inputs. For our modeling, the outcome is to predict the across the entire domain. We can generalize the idea with the “is a function of” operator ~
(the tilde). For yhe classic formula for a line it like this… y ~ x
.
Consider a situation where we have reduced all of the suitabke variables to Sbtm
, Tbtm
, MLD
andXbtm
, which we have in a table along with a class
variable. In our case we have the outcome is an prediction of class
it is a function of variables like Sbtm
, Tbtm
, MLD
, Xbtm
, etc. This formula would look like y ~ Sbtm + Tbtm + MLD + Xbtm
. Unlike the specific equation for a line or parabola, we don’t pretend to know what coefficients, powers and that sort of stuff looks like. We are just saying that class
is a function of all of those variables (somehow).
In the case where table where the outcome (class
) is a function of all other variables in the table, we have a nice short hand. class ~ .
where the dot means “every other variable”.
Below we make a recipe which is like template. We provide the table of variables (with spatial info) - techinally we need only need to provide one row as it is building a template, it doesn’t store the data (yet). We also provide the formula which we read as “class as a funtion of everything else”. The recipe()
knows detects the geometry column and handles that specially.
= recipe(variables, formula = class ~ .)
aug_recipe aug_recipe
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 7
coords: 2
4.2 Make a workflow set
Here’s where we specify which model(s) we want, and that we want them to be treated as an ensemble rather than as a individual models.
<-
models # create the workflow_set
workflow_set(
preproc = list(default = aug_recipe),
models = list(
# the standard glm specs - no tuneable parameters available
glm = sdm_spec_glm(),
# rf specs with tuning
rf = sdm_spec_rf(),
# boosted tree model (gbm) specs with tuning
gbm = sdm_spec_boost_tree(),
# maxent specs with tuning
maxent = sdm_spec_maxent()
),# make all combinations of preproc and models,
cross = TRUE ) |>
# tweak controls to store information needed later to create the ensemble
option_add(control = control_ensemble_grid())
We have chosen these 4 models types to be memebers of our ensemble because they are common models types used for species distribution models. We aren’t limited to these, but we will use these for the tutorial.
When we print the models
object we see that it, too, is a table, but some of it’s columns are list
types, and some have empty elements. That’s because we still have not actually modeled anything… we are still in the building phase.
models
# A workflow set/tibble: 4 × 4
wflow_id info option result
<chr> <list> <list> <list>
1 default_glm <tibble [1 × 4]> <opts[1]> <list [0]>
2 default_rf <tibble [1 × 4]> <opts[1]> <list [0]>
3 default_gbm <tibble [1 × 4]> <opts[1]> <list [0]>
4 default_maxent <tibble [1 × 4]> <opts[1]> <list [0]>
4.3 Auto-tuning the ensemble models
Now we can allow the underlying software to run multiple times as it looks to optimize the different parameters that each model has (if the model has tuneable parameters). Here we ask for the software iterate over the crossfolds
sampling set for each model in the ensemble. We then request the standard diagnostics metrics used in species distribution modeling:Boyce’s Continuous Index, the area under the Receiver Operator Curve (auc), and True Skills Statistic (tss).
Depending upon the complexity of your set up this can take a while. We no longer are working with a template, but rather with full-blown datasets.
<- models |>
models workflow_map("tune_grid",
resamples = crossfolds,
grid = 3,
metrics = sdm_metric_set(),
verbose = TRUE)
i No tuning parameters. `fit_resamples()` will be attempted
i 1 of 4 resampling: default_glm
✔ 1 of 4 resampling: default_glm (609ms)
i 2 of 4 tuning: default_rf
i Creating pre-processing data to finalize unknown parameter: mtry
✔ 2 of 4 tuning: default_rf (42.3s)
i 3 of 4 tuning: default_gbm
i Creating pre-processing data to finalize unknown parameter: mtry
✔ 3 of 4 tuning: default_gbm (31.5s)
i 4 of 4 tuning: default_maxent
✔ 4 of 4 tuning: default_maxent (13.7s)
We get some messages as it processes, but it is hard for the novice to know what they mean, but at least they aren’t error messages. But we can print the updated models and compare to what we had before.
models
# A workflow set/tibble: 4 × 4
wflow_id info option result
<chr> <list> <list> <list>
1 default_glm <tibble [1 × 4]> <opts[4]> <rsmp[+]>
2 default_rf <tibble [1 × 4]> <opts[4]> <tune[+]>
3 default_gbm <tibble [1 × 4]> <opts[4]> <tune[+]>
4 default_maxent <tibble [1 × 4]> <opts[4]> <tune[+]>
We can see that there is more information in there than there was before. Note that the result
for default_glm
is labeled differently than the others; that’s because that particular model doesn’t have any tuneable parameters so the software couldn’t tune anything.
4.4 Ensemble model metrics
Let’s plot the models in the ensemble, which will reveal some of the diagnostic metrics.
autoplot(models)
Notice that there are three replicates for each model except for logistic_reg
which is the glm
that had no tuneable parameters. Also note that the order they are represented in each plot is established by the order of the first plot and then carried through to the others. We can probably agree by examining the left hand plot, that the maxent
model has the highest Boyce Continuous Index which means it’s rank is 1. Confusingly higher rank values means lower performance; maybe we should think of it a “first place”.
But what decides that Boyce Continuous Index should be allowed to establish the order? It’s just alphabetical order! To be frank, the workflow rank is not all that meaningful.
At this point we can decide of we are satisfied with how the individual models perform with the crossfolds. If we found them lacking we could go back to add or remove models, adjust our sampling scheme, change the covariate selection or anything else we felt might improve the performance. We won’t do that now, but we’ll accept the tuned parameters “as is” and push ahead.
4.5 Establish the ensemble
Now we come to point we make a leap and build the actual ensemble - this is the thing we shall be able to use to make predictions. We only need to specify which of the metrics to preferentially use when selecting the optimized parameters. Are you ready for this?
<-
ensemble simple_ensemble() |> # make an empty ensemble
add_member(models, metric = "boyce_cont") # populate it
That’s it - tahdah! We’re done! We can plot it.
autoplot(ensemble)
Notice that we are no longer looking at the crossfold replicates but instead we are looking at summary of the individual models run on the complete data sets. Again, “Workflow Rank” in an arbitrary order. We can see that maxent
performs well on all of the metrics, but Random Forests and Boosted Regression Trees are pretty good, too. The Generalize Linear Model seems to be the odd one out, which isn’t to say that it doesn’t work.
We can also pull out a table of these metric results.
|>
ensemble collect_metrics()
# A tibble: 12 × 5
wflow_id .metric mean std_err n
<chr> <chr> <dbl> <dbl> <int>
1 default_glm boyce_cont 0.403 0.412 3
2 default_glm roc_auc 0.606 0.0419 3
3 default_glm tss_max 0.217 0.0345 3
4 default_rf boyce_cont 0.869 0.0854 3
5 default_rf roc_auc 0.663 0.0300 3
6 default_rf tss_max 0.250 0.0488 3
7 default_gbm boyce_cont 0.803 0.0932 3
8 default_gbm roc_auc 0.680 0.0262 3
9 default_gbm tss_max 0.297 0.0286 3
10 default_maxent boyce_cont 0.790 0.171 3
11 default_maxent roc_auc 0.692 0.0400 3
12 default_maxent tss_max 0.336 0.0420 3
Useful, but prettier as a plot!
5 Saving an ensemble for later use
We can save the ensemble to load back in later, we just need to provide the version information.
write_ensemble(ensemble, version = "g_Aug")
6 Recap
We have built an ensemble of 4 models using a recipe and workflow. Models were partially designed using variables we selected in an earlier step. We allowed the model building process to “self-tune” by dividing the sample set into multi-folds which are orderly subsamples of our input data usually call “cross-folds”. We then looked at some diagnostic metrics to help us understand model performance, before building an ensemble of models. We saved the ensemble of models for restoring later.
Prepare and save an ensemble for each month for both “greedy” and “conservative” sampling approaches.