source("setup.R")
Covariates
- “In the end that was the choice you made, and it doesn’t matter how hard it was to make it. It matters that you did.”
- Cassandra Clare
Now we turn our attention to what we know and guess about the environments. We are using the Brickman data to make habitat suitability maps for select species under two climate scenarios (RCP45 and RCP85) at two different times (2055 and 2075) in the future. Each variable we might use is called covariate or predictor. Our covariates are nicely packaged up and tidy, but the reality is that it often requires a good deal of data wrangling if the data are messy.
Our step here is to make sure that two or more covariates are not highly correlated if they are, then we would likely want to drop all but one.
1 Setup
As always, we start by running our setup function. Start RStudio/R, and relaod your project with the menu File > Recent Projects
.
2 A broad approach - looking for correlation across the domain
We can look at the entire domain, the complete spatial extent of our arrays of data, to look for correlated variables. For example, we might wonder if sea surface temperature(SST) and sea floor temperature (Tbtm) vary together, when one goes up the other goes up. That sort of thing. We have ways of getting at those correlations.
2.1 Reading in the covariates
We’ll read in the Brickman database, then filter two different subsets to read: “STATIC” covariate bathymetry that apply across all scenarios and times and monthly covariates for the “PRESENT” period.
= brickman_database()
db = read_brickman(filter(db, var == 'depth'))
depth = read_brickman(filter(db, scenario == "PRESENT", interval == "mon")) present
We should pick one month, and join that month’s model data with depth. We have used August before as our example, let’s continue with August.
= present |>
aug ::slice("month", "Aug")
dplyr= bind_attrs(aug, depth) aug
2.2 Make a pairs
plot
A pairs
plot is a plot often used in exploratory data analysis. It makes a grid of mini-plots of a set of variables, and reveals the relationships among the variables pair-by-pair. It’s easy to make.
pairs(aug)
In the lower left portion of the plot we see paired scatter plots, at upper right we see the correlation values of the pairs, and long the diagonal we see a histogram of each variable. Some pairs are highly correlated, say over 0.7
, and to include both in the modeling might not provide us with greater predictive power. It may feel counterintuitive to remove any variables - more data means more information, right? And more information means more informed models. Consider two measurements, human height and inseam. We might use these to predict if a person can dunk a basketball, but since they are probably strongly correlated do you really need both?
2.3 Identify the most independent variables (and the most collinear)
We have a function that can help use select which variables to remove. filter_collinear()
returns a listing of variables it suggests we keep. It attaches to the return value an attribute (like a post-it note stuck on a box) that lists the complementary variables that it suggests we drop. We are choosing a particular method you can learn more about using R’s help ?filter_collinear
.
= filter_collinear(aug, method = "vif_step")
keep keep
[1] "MLD" "Sbtm" "SSS" "SST" "Tbtm" "U" "V"
attr(,"to_remove")
[1] "Xbtm" "depth"
Of course, we can decide to ignore this advice, and pick which ever ones we want including keeping them all.
Whatever selection of variable we decide to model with, we will save this listing to a file. That way we can refer to it progammatically. But that comes later.
2.4 A closer look at the model input data
Before we do commit to a selection of variables, let’s turn our attention back to our presence-background points, and look at just those chosen values rather than at values drawn form across the entire domain. Let’s open the file that contains the “greedy” model input for August during the PRESENT climate scenario.
= read_model_input(scientificname = "Mola mola",
model_input approach = "greedy",
mon = "Aug")
model_input
Simple feature collection with 7277 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -74.89169 ymin: 38.805 xmax: -65.02004 ymax: 45.21401
Geodetic CRS: WGS 84
# A tibble: 7,277 × 2
class geom
<chr> <POINT [°]>
1 presence (-72.8074 39.056)
2 presence (-71.343 40.52)
3 presence (-68.7691 41.52448)
4 presence (-67.79 43.32)
5 presence (-68.44324 42.61177)
6 presence (-72.4328 40.213)
7 presence (-71.8784 40.3569)
8 presence (-65.78 43.195)
9 presence (-70.5 42.767)
10 presence (-72.3024 40.1862)
# ℹ 7,267 more rows
Next we’ll extract data values from our August covariates.
= extract_brickman(aug, model_input, form = "wide")
variables variables
# A tibble: 7,277 × 10
point MLD Sbtm SSS SST Tbtm U V Xbtm depth
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 p0001 5.17 35.0 31.6 23.3 7.50 -0.00161 -0.00340 0.00133 304.
2 p0002 4.25 32.8 30.6 21.6 8.15 -0.00420 -0.00206 0.00166 71.6
3 p0003 4.64 34.0 30.7 20.2 7.05 0.00168 0.00148 0.000793 138.
4 p0004 5.58 34.6 30.7 18.8 7.55 0.00267 -0.000410 0.000957 234.
5 p0005 5.04 34.7 30.7 19.0 7.43 -0.00619 -0.00121 0.00224 205.
6 p0006 4.01 32.4 30.6 22.0 8.22 -0.00344 -0.000859 0.00126 62.6
7 p0007 4.10 32.9 30.5 21.8 8.34 -0.00565 -0.00226 0.00216 71.3
8 p0008 3.82 32.4 30.3 18.2 3.56 -0.00702 -0.00431 0.00293 81.6
9 p0009 3.20 32.4 30.6 17.9 5.73 0.000275 -0.00101 0.000372 70.6
10 p0010 4.02 32.9 30.6 22.0 8.62 -0.000900 -0.00148 0.000614 64.9
# ℹ 7,267 more rows
We are going to call a plotting function, plot_pres_vs_bg()
, that wants some of the data from model_input
and some of the data in variables
. So, we have to do some data wrangling to combine those; we’ll add class
to variables
and then drop the point
column.
= variables |>
variables mutate(class = model_input$class) |> # the $ extracts a column
select(-point) # the - means "deselect" or "drop"
variables
# A tibble: 7,277 × 10
MLD Sbtm SSS SST Tbtm U V Xbtm depth class
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 5.17 35.0 31.6 23.3 7.50 -0.00161 -0.00340 0.00133 304. presence
2 4.25 32.8 30.6 21.6 8.15 -0.00420 -0.00206 0.00166 71.6 presence
3 4.64 34.0 30.7 20.2 7.05 0.00168 0.00148 0.000793 138. presence
4 5.58 34.6 30.7 18.8 7.55 0.00267 -0.000410 0.000957 234. presence
5 5.04 34.7 30.7 19.0 7.43 -0.00619 -0.00121 0.00224 205. presence
6 4.01 32.4 30.6 22.0 8.22 -0.00344 -0.000859 0.00126 62.6 presence
7 4.10 32.9 30.5 21.8 8.34 -0.00565 -0.00226 0.00216 71.3 presence
8 3.82 32.4 30.3 18.2 3.56 -0.00702 -0.00431 0.00293 81.6 presence
9 3.20 32.4 30.6 17.9 5.73 0.000275 -0.00101 0.000372 70.6 presence
10 4.02 32.9 30.6 22.0 8.62 -0.000900 -0.00148 0.000614 64.9 presence
# ℹ 7,267 more rows
Finally, can make a specialized plot comparing our variables for each class: presence
and background
.
plot_pres_vs_bg(variables, "class")
How does this inform our thinking about reducing the number of variables? For which variables do presence
and background
values mirror each other? Which have the least overlap? We know that the model works by finding optimal combinations of covariates for the species. If there is never a difference between the conditions for presences
and background
then how will it find the optimal niche conditions?
2.5 Saving a file to keep track of modeling choices
You may have noticed that we write a lot of things to files (aka, “writing to disk”). It’s a useful practice especially when working with a multi-step process. One particular file, a configuration file, is used frequently in data science to store information about the choices we make as we work through our project. Configuration files generally are simple text files that we can easily get the computer to read and write.
In R, a confguration is treated as a named list. Each element of a list is named, but beyond that there aren’t any particular rules about confugurations. You can learn more about configurations in this tutorial.
Let’s make a confuguration list that holds 4 items: version identifier, species name, sampling approach and the names of the variables to model with.
= list(
cfg version = "g_Aug", # g for greedy!
scientificname = "Mola mola",
approach = "greedy",
mon = "Aug",
keep_vars = keep)
We can access by name three ways using what is called “indexing” : using the [[
indexing brackets, using the $
indexing operator or using the getElement()
function.
'scientificname']] cfg[[
[1] "Mola mola"
2]] cfg[[
[1] "Mola mola"
$scientificname cfg
[1] "Mola mola"
getElement(cfg, "scientificname")
[1] "Mola mola"
getElement(cfg, 2)
[1] "Mola mola"
Now we’ll write this list to a file. First let’s set up a pathwy where we might store these configurations, and for that matter, to store our modeling files. We’ll make a new directory, models/g008
and write the configuration there. We’ll use the famous “YAML” format to store the file. See the file functions/configuration.R
for documentation on reading and writing.
= make_path(data_path("models")) # make a directory for models
ok write_configuration(cfg)
Use the Files
pane to navigate to your personal data directory. Open the g_Aug.yaml
file - this is what you configuration looks like in YAML. Fortunately we don’t mess manually with these much.
3 Recap
We loaded the covariates for the “PRESENT” climate scenario and looked at collinearity across the entore study domain. We invoked a function that suggests which variables to keep and which to drop based upon collinearity. We examined the covariates at just the presence
and background
locations. We then saved a configuration for later reuse.
Open nd edit the file called functions/select_covariates.R
. Within the file write the function(s) you need to select the “keeper” variables for a given approach (greedy or conservative) and a given month (Jan - Dec). Have the function return an appropriate configuration list. The function shoulkd start out approximately like this…
#' Given a species, month and sampling approach select variabkes for each month
#'
#' @param approach chr one of "greedy" (default) or "conservative"
#' @param mon chr month abbreviation ("Jan" default)
#' @param scientificname chr the species studied (default "Mola mola")
#' @param path chr file path to the personal data directory
#' @return a configuration list
select_covariates = function(approach = "greedy",
mon = "Jan",
scientificname = "Mola mola",
path = data_path()){
ret = list(
version = <something you make goes here>,
scientificname = scientificname,
approach = approach,
mon = month,
keep_vars = <something you make goes here>)
}
Use the iterations tutorial to apply your select_covariates()
for each month using each approach. At each iteration write the configuration. When you are done, you should have 12 YAML files for each approach - so 24 YAML files written all together for each species.