source("setup.R")
Models
- All models are wrong, but some are useful.
- George Box
Modeling starts with a collection of observations (presence and background for us!) and ends up with a collection of coeefficients that can be used with one or more formulas to make a predicition for the past, the present or the future. We are using modeling specifically to make habitat suitability maps for select species under two climate scenarios (RCP45 and RCP85) at two different times (2055 and 2075) in the future.
We can choose from a number of different models: random forest “rf”, maximum entropy “maxent” or “maxnet”, boosted regression trees “brt”, general linear models “glm”, etc. The point of each is to make a mathematical representation of natural occurrences. It is important to consider what those occurences might be - categorical like labels? likelihoods like probabilities? continuous like measurements? Here are examples of each…
- Categorical
- two class labels: “present/absence”, “red/green”, “shell/no shell”, “alive/dead”
- multi-class labels: “vanilla/chocolate/strawberry”, “immature/mature/aged”
- Likelihood and Probability
- probability: “50% chance of rain”, “80% chance of a fatal fall”
- relativity: “low likelihood of encounter”, “some likelihood of encounter”
- Continuous
- abundance: “48.2 mice per km^2”, “10,500 copepods per m^3”
- rate: “50 knot winds”, “28.2 Svedrups”
- measure: “3.2 cm of rain”, “12.1 grams of carbon”
We are modeling with known observations (presences) and a sampling of the background, so we are trying to model a likelihood that a species will be encountered (and reported) relative to the environmental conditions. We are looking for a model that can produce relative likelihood of an encounter that results in a report.
We’ll be using a random forest model (rf). We were inspired to follow this route by using this tidy models tutorial prepared by our colleague Omi Johnson.
1 Setup
As always, we start by running our setup function. Start RStudio/R, and reload your project with the menu File > Recent Projects
.
2 Load data - choose a month and sampling approach
Let’s load what we need to build a model for August using the greedy sampling technique. We’ll also need the model configuration (which is “g_Aug”). And we’ll need the covariate data. Notice that we select the covariates that are included in our configuration.
= read_model_input(scientificname = "Mola mola",
model_input approach = "greedy",
mon = "Aug")
= read_configuration(version = "g_Aug")
cfg = brickman_database()
db = read_brickman(db |> filter(scenario == "PRESENT", interval == "mon"))|>
covars select(all_of(cfg$keep_vars))
Of course we need covariates for August only, for this we can use a function we prepared earlier, prep_model_data()
. Note the we specifically ask for a plain table which means we are dropping the spatial information for now. Also, we select only the variables required in the configuration, plus the class
label.
= prep_model_data(model_input,
all_data month = "Aug",
covars = covars,
form = "table") |>
select(all_of(c("class", cfg$keep_vars)))
all_data
# A tibble: 7,272 × 8
class MLD Sbtm SSS SST Tbtm U V
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 presence 5.17 35.0 31.6 23.3 7.50 -0.00161 -0.00340
2 presence 4.25 32.8 30.6 21.6 8.15 -0.00420 -0.00206
3 presence 4.64 34.0 30.7 20.2 7.05 0.00168 0.00148
4 presence 5.58 34.6 30.7 18.8 7.55 0.00267 -0.000410
5 presence 5.04 34.7 30.7 19.0 7.43 -0.00619 -0.00121
6 presence 4.01 32.4 30.6 22.0 8.22 -0.00344 -0.000859
7 presence 4.10 32.9 30.5 21.8 8.34 -0.00565 -0.00226
8 presence 3.82 32.4 30.3 18.2 3.56 -0.00702 -0.00431
9 presence 3.20 32.4 30.6 17.9 5.73 0.000275 -0.00101
10 presence 4.02 32.9 30.6 22.0 8.62 -0.000900 -0.00148
# ℹ 7,262 more rows
3 Split the data set into testing and training data sets
We will split out a random sample of our dataset to a larger set used for training the model, and a smaller set we withhold to use for later testing of the model. Since we have labeled data (“presence” and “background”) we want to be sure we sample these in proportion, for that we’ll indicate that the data are stratified (into just two groups). Let’s first determine what the proportion is before splitting.
# A little function to compute the ratio of presences to background
# @param x table with a "class" column
# @return numeric ratio presences/background
= function(x){
get_ratio = count(x, class)
counts = filter(counts, class == "presence") |> pull(n)
np = filter(counts, class == "background") |> pull(n)
nb return(np/nb)
}
cat("ratio of presence/background in full data set:", get_ratio(all_data), "\n")
ratio of presence/background in full data set: 0.50934
Now let’s make the split with the training set comprising 75% of all_data
. Note that we specifically identify class
as the strata
(or grouping) variable.
= initial_split(all_data,
split_data prop = 3/4,
strata = class)
split_data
<Training/Testing/Total>
<5453/1819/7272>
It prints the counts of the training data, the testing data and the entire data set. We can extract the training data and testing data using the training()
and testing()
functions. Let’s check the ratios for those..
cat("ratio of presence/background in training data:",
training(split_data) |> get_ratio(), "\n")
ratio of presence/background in training data: 0.5092721
cat("ratio of presence/background in testing data:",
testing(split_data) |> get_ratio(), "\n")
ratio of presence/background in testing data: 0.5095436
OK! The samples observed the original proportion of presence/background.
Note! Did you notice that the function is called
initial_split()
, which implies a subsequent split - what do you suppose that is about?
4 Create a workflow
workflows are containers for storing the data pre-processing steps and model specifications. Not too long ago it was quite a challenge to to keep track of all the bits and pieces required to make good forecasts. The advent of workflows
greatly simplifies the process. A workflow
will house two important items for us: a recipe and a model. For now, we’ll create an empty workflow, then add to it as needed. At the very end, we’ll save the workflow.
= workflow() wflow
That’s it!
5 Build a recipe
The first thing we’ll add tot he workflow is a recipe.
A recipe
is a blueprint that guides the data handling and modeling process.
A recipe at a bare minimum needs to know two things: what data it has to work with and what is the relationship among the variables within the data. The latter is expressed as a formula, very similar to how we specify the formula of a line with y = mx + b
or a parabola y = ax^2 + bx + c
.
We often think of formulas as left-hand side (LHS) and right-hand side (RHS) equalities. And usually, the LHS is the outcome while the RHS is about the inputs. For our modeling, the outcome is to predict the across the entire domain. We can generalize the idea with the “is a function of” operator ~
(the tilde). For the classic formula for a line it like this… y ~ x
and a parabola is also y ~ x
.
Consider a situation where we have reduced all of the suitable variables to Sbtm
, Tbtm
, MLD
andXbtm
, which we have in a table along with a class
variable. In our case we have the outcome is an prediction of class
it is a function of variables like Sbtm
, Tbtm
, MLD
, Xbtm
, etc. This formula would look like y ~ Sbtm + Tbtm + MLD + Xbtm
. Unlike the specific equation for a line or parabola, we don’t pretend to know what coefficients, powers and that sort of stuff looks like. We are just saying that class
is a function of all of those variables (somehow).
In the case here where the outcome (class
) is a function of all other variables in the table, we have a nice short hand. class ~ .
where the dot means “every other variable”.
First we fish out of our split data the training data, and then drop the spatial information.
= training(split_data)
tr_data tr_data
# A tibble: 5,453 × 8
class MLD Sbtm SSS SST Tbtm U V
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 background 3.53 29.5 29.0 17.7 15.6 -0.00355 0.000511
2 background 3.92 31.1 28.9 17.9 7.64 -0.0000966 0.00158
3 background 4.51 28.6 28.6 19.8 19.9 0.00193 0.00234
4 background 4.56 28.4 28.3 20.2 20.3 0.00202 -0.00158
5 background 3.63 30.1 29.3 17.6 13.6 -0.000100 0.00167
6 background 2.84 31.6 29.3 17.6 5.80 -0.00135 0.00327
7 background 4.02 31.9 29.3 17.8 5.38 -0.000583 0.00207
8 background 4.46 32.1 29.3 18.0 5.33 -0.000284 0.00229
9 background 4.34 32.4 29.2 18.1 5.50 -0.000321 0.00283
10 background 4.01 32.4 29.1 18.1 5.53 0.000172 0.00369
# ℹ 5,443 more rows
Now we make the recipe. Note that no computation takes place.
Technically, recipe()
only needs a small subset of the data set to establish the names and data types of the predictor and outcome variables. Just one row would suffice. That underscores that a recipe is simply building a template.
= recipe(class ~ ., data = slice(tr_data,1))
rec rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 7
This print out provides a very high level summary - all of the details are glossed over. To get a more detailed summary use the summary()
function.
summary(rec)
# A tibble: 8 × 4
variable type role source
<chr> <list> <chr> <chr>
1 MLD <chr [2]> predictor original
2 Sbtm <chr [2]> predictor original
3 SSS <chr [2]> predictor original
4 SST <chr [2]> predictor original
5 Tbtm <chr [2]> predictor original
6 U <chr [2]> predictor original
7 V <chr [2]> predictor original
8 class <chr [3]> outcome original
Each variable in the input is assigned a role: “outcome” or “predictor”. The latter are the variables used in the creation of the model. There are other types of roles, (see ?recipe
) including “case_weight” and “ID”, and others can be defined as needed. Some are used in building the model, others are simply ride along and don’t change the model outcome.
5.1 Modifying the recipe with steps
Steps are cumulative modifications, and that means the order in which they are added matters. These steps comprise the bulk of pre-processing steps.
Some modifications are applied row-by-row. For example, rows of the input modeling data that have one or more missing values (NAs) can be problematic and they should be removed.
Other modifications are to manipulate entire columns. Sometimes the recipes requires subsequent steps before the modeling begins in earnest. For example we know from experience that it is often useful to log scale (base 10) depth when working with biological models. If depth
and Xbtm
have made it this far, you’ll note that each range over 4 or more orders of magnitude. That’s not a problem by itself, but it can introduce a bias toward larger values whenever the mean is computed. So, we’ll add a step for log scaling these, but only if depth
and Xbtm
have made it this far (this may vary by species.)
= rec |>
rec step_naomit()
if ("depth" %in% cfg$keep_vars){
= rec |>
rec step_log(depth, base = 10)
}if ("Xbtm" %in% cfg$keep_vars){
= rec |>
rec step_log(Xbtm, base = 10)
} rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 7
── Operations
• Removing rows with NA values in: <none>
Next we state that we want to remove variables that might be highly correlated with other variables. If two variables are highly correlated, they will not provide the modeling system with more information, just redundant information which doesn’t neccessarily help. step_corr()
accepts a variety of arguments specifying which variables to test to correlation including some convenience selectors like all_numeric()
, all_string()
and friends. We want all predictors which happen to all be numeric, so we can use all_predictors()
or all_numeric_predictors()
. Specificity is better then generality so let’s choose numeric predictors.
We have already tested variables for high collinearlity, but here we can add a slightly different filter, high correlation, for the same issue. Since we have dealt with this already we shouldn’t expect that step will change the preprocessing very much. But it is instructive to see it in action.
= rec |>
rec step_corr(all_numeric_predictors())
rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 7
── Operations
• Removing rows with NA values in: <none>
• Correlation filter on: all_numeric_predictors()
5.2 Add the recipe to the workflow
= wflow |>
wflow add_recipe(rec)
wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: None
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_naomit()
• step_corr()
6 Build a model
We are going to build a random forest “rf” model in classification mode which means for us that we have predictions of “presence” or “background”. That’s just two classes, random forests can predict multiple classes, too. Also, random forests can make regression models which are used for continuous data. Below we start the model, declare its mode and assign an engine (the package we prefer to use.) We’ll be using the ranger R pakage.
6.1 Create the model
We create a random forest model, declare that it should be run in classification mode (not regression mode), and then specify that we want to use the ranger
modeling engine (as opposed to, say, the randForest
engine). We additionally specify that it should be able to produce probablilites of a class not just the class label. We also request that it saves bits of info so that we can compare the relative importance of the covariates.
= rand_forest() |>
model set_mode("classification") |>
set_engine("ranger", probability = TRUE, importance = "permutation")
model
Random Forest Model Specification (classification)
Engine-Specific Arguments:
probability = TRUE
importance = permutation
Computational engine: ranger
Well, that feels underwhelming. We can pass arguments unique to the engine using the set_args()
function, but, for now we’ll accept the defaults.
6.2 Add the model to the workflow
Now we simply add the model to the workflow.
= wflow |>
wflow add_model(model)
wflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_naomit()
• step_corr()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Engine-Specific Arguments:
probability = TRUE
importance = permutation
Computational engine: ranger
7 Fit the model
= fit(wflow, data = tr_data)
fitted_wflow fitted_wflow
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_naomit()
• step_corr()
── Model ───────────────────────────────────────────────────────────────────────
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, probability = ~TRUE, importance = ~"permutation", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
Type: Probability estimation
Number of trees: 500
Sample size: 5453
Number of independent variables: 7
Mtry: 2
Target node size: 10
Variable importance mode: permutation
Splitrule: gini
OOB prediction error (Brier s.): 0.2109653
8 Making predictions
Predicting is easy with this pattern: predictions = predict(model, newdata, ...)
We want to specify that we want probabilites of a particular class being predicted. In each case we bind to the prediction our original classification, class
.
8.1 Predict with the training data
First we shall predict with the same data we trained with. The results of this will not really tell us much about our model as it is very circular to predict using the very data used to build the model. So this next section is more about a first pass at using the tools at your disposal.
= predict_table(fitted_wflow, tr_data, type = "prob")
train_pred train_pred
# A tibble: 5,453 × 4
.pred_presence .pred_background .pred class
<dbl> <dbl> <fct> <fct>
1 0.0126 0.987 background background
2 0.0821 0.918 background background
3 0.336 0.664 background background
4 0.223 0.777 background background
5 0.00834 0.992 background background
6 0.0647 0.935 background background
7 0.0380 0.962 background background
8 0.0311 0.969 background background
9 0.0466 0.953 background background
10 0.0853 0.915 background background
# ℹ 5,443 more rows
Here the variables prepended with a dot .
are computed, while the class
variable is our original. There are many metrics we can use to determine how well this model predicts. Let’s start with the simplest thing… we can make a simply tally of .pred
and class
.
count(train_pred, .pred, class)
# A tibble: 4 × 3
.pred class n
<fct> <fct> <int>
1 presence presence 1389
2 presence background 351
3 background presence 451
4 background background 3262
There false positives and false negatives, but many are correct. Of course, this is predicting with the very data we used to train the model; knowing that this is predicicting on training data with some many misses might not inspire confidence. But let’s explore more.
8.2 Assess the model
Hewre we walk through a number of common assessment tools. We want to assess a model to ascertain how closely it models reality (or not!) Using the tools is always easy, interpreting the metrics is not always easy.
8.2.1 Confusion matrix
The confusion matrix is the next step beyond a simple tally that we made above.
= conf_mat(train_pred, class, .pred)
train_confmat train_confmat
Truth
Prediction presence background
presence 1389 351
background 451 3262
You’ll see this is the same as the simple tally we made, but it comes with handy plotting functionality (shown below). Note that a perfect model would have the upper left and lower right quadrants fully accounting for all points. The lower left quadrant shows us the number of false-negatives while the upper right quadrant shows the number of false-positives.
autoplot(train_confmat, type = "heatmap")
8.2.2 ROC and AUC
The area under the curve (AUC) of the receiver-operator curve (ROC) is a common metric. AUC values range form 0-1 with 1 reflecting a model that faithfully predicts correctly. Technically an AUC value of 0.5 represents a random model (yup, the result of a coin flip!), so values greater than 0.5 and less than 1.0 are expected.
First we can plot the ROC.
plot_roc(train_pred, class, .pred_presence)
We can assure you from practical experience that this is an atypical ROC. Typically they are not smooth, but this smoothness is an artifact of our use of training data. If you really only need the AUC, you can use the roc_auc()
function directly.
roc_auc(train_pred, class, .pred_presence)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.940
8.2.3 Accuracy
Accuracy, much like our simple tally above, tells us what fraction of the predictions are correct. Not that here we explicitly provide the predicted class label (not the probability.)
accuracy(train_pred, class, .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.853
8.2.4 Partial dependence plot
Partial dependence reflects the relative contrubution of each variable influence over it’s full range of values. The output is a grid grid of plots showing the relative distribution of the variable (bars) as well as the relative influenceof the variable (line).
partial_dependence_plot(fitted_wflow, data = tr_data)
8.3 Predict with the testing data
Finally, we can repeat these steps with the testing data. This should give use better information than using the training data
8.3.1 Predict
= testing(split_data)
test_data = predict_table(fitted_wflow, test_data, type = "prob")
test_pred test_pred
# A tibble: 1,819 × 4
.pred_presence .pred_background .pred class
<dbl> <dbl> <fct> <fct>
1 0.210 0.790 background presence
2 0.472 0.528 background presence
3 0.847 0.153 presence presence
4 0.269 0.731 background presence
5 0.729 0.271 presence presence
6 0.0251 0.975 background presence
7 0.690 0.310 presence presence
8 0.00902 0.991 background presence
9 0.151 0.849 background presence
10 0.399 0.601 background presence
# ℹ 1,809 more rows
8.3.2 Confusion matrix
= conf_mat(test_pred, class, .pred)
test_confmat autoplot(test_confmat, type = "heatmap")
8.3.3 ROC/AUC
plot_roc(test_pred, class, .pred_presence)
This ROC is more typical of what we see in regular practice.
8.3.4 Accuracy
accuracy(test_pred, class, .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.695
8.3.5 Partial Dependence
partial_dependence_plot(fitted_wflow, data = test_data)
9 Saving recipes and models to disk as a workflow
We can (and should!) save recipes and models to disk for later recall. We need the recipe because it handle the pre-processing of our covariates, while the model specifies both the form of the model as well as the necessary coefficients. When bundled together for later use we can be assured the the data pre-processing steps and model specifications will be available. A workflow is a container for recipes, models and other parts of the model process.
Now we can save the workflow container.
write_workflow(fitted_wflow, version = cfg$version)
You can read it back later with read_workflow()
.
10 Recap
We have built a random forest model using tools from the tidymodels universe. After reading in a suite of data, we split our data into training and testing sets, witholding the testing set until the very end. We looked a variety of metrics including a simple tally, a confusion matrix, ROC and AUC, accuracy and partial dependencies. We saved the recipe and model together in a special container, called a workflow, to a file.
11 Coding Assignment
Use the iterations tutorial to build a workflow for each month using one or both of your background selection methods. Save each workflow in the models
directory. If you chose to do both background selection methods then you should end up with 24 workflows (12 months x 2 background sampling methods).