library(psptools)
Configuration files
Storing all of the parameters used in the forecast model in a configuration file
<- list(
cfg configuration="test",
image_list = list(tox_levels = c(0,10,30,80),
forecast_steps = 1,
n_steps = 3,
minimum_gap = 4,
maximum_gap = 10,
multisample_weeks="last",
toxins = c("gtx4", "gtx1", "dcgtx3", "gtx5", "dcgtx2", "gtx3",
"gtx2", "neo", "dcstx", "stx", "c1", "c2")),
model = list(balance_val_set=FALSE,
downsample=FALSE,
use_class_weights=FALSE,
dropout1 = 0.3,
dropout2 = 0.3,
batch_size = 32,
units1 = 32,
units2 = 32,
epochs = 128,
validation_split = 0.2,
shuffle = TRUE,
num_classes = 4,
optimizer="adam",
loss_function="categorical_crossentropy",
model_metrics=c("categorical_accuracy")),
train_test = list(split_by="year_region_species",
train = list(
year = c("2015", "2016", "2017", "2018", "2019", "2020", "2021"),
region = c("maine"),
species = c("mytilus")),
test = list(
year = c("2014"),
region= c("maine"),
species = c("mytilus")))
)
Model configurations are list objects saved as .yaml files in ‘inst/configurations’. The configuration files can be read using read_config(filename)
.
configuration
The name of the configuration.
image_list
The section of the configuration file that defines how images will be mined out of the raw toxin measurement data.
tox_levels
A vector of integers that define the bins of total toxicity used to classify images.
forecast_steps
An integer defining how many steps ahead of the image the label comes from (for training) and the predictions will be made for (for forecasting).
n_steps
An integer defining the number of steps used to make an image.
minimum_gap
An integer defining the minimum number of days between samples in order to use them to make an image.
maximum_gap
An integer defining the maximum number of days between samples in order to use them to make an image.
multisample_weeks
A character defining the method used to pick a sample when a site has been visited multiple times in a week. Options are ‘first’, ‘last’, ‘minimum’, ‘maximum’.
toxins
A list of character names of the toxins (or other variables) to keep from each sample.
model
balance_val_set
Logical indicating whether or not to balance the class distribution in the validation set. The fit()
function accepts either a percent of the training set to randomly sample to create the validation set. Alternatively, the samples in the validation set can be specifically assigned. balance_val_set = TRUE
will withhold the validation_split
from each of the possible classes.
downsample
Logical indicating whether or not to balance the class distribution in the training set. downsample=TRUE
will randomly sample the number of samples in the lowest class from the other three.
use_class_weights
Logical indicating whether or not to use class weights during training.
dropout1
Double defining the dropout rate for the input layer.
dropout2
Double defining the dropout rate for the hidden layer.
batch_size
Integer defininf the batch size.
units1
Integer defining the number of units in the input layer.
units2
Integer defining the number of units in the hidden layer.
epochs
Integer defining the number of training epochs.
validation_split
Double defining the portion of the training set to withhold for validation during training.
shuffle
Logical indicating whether or not to shuffle the training set before splitting off the validation set.
num_classes
Integer defining the number of classes.
optimizer
Character devining the optimizer.
loss_function
Character defining the loss function to use during training.
model_metrics
Character defining the metric to use during training.
train_test
Defines how to split the data being used for training and testing (or forecasting).
split_by
The package offers three options for splitting the train and test sets - “year_region_species”, “fraction” or “function”. If forecasting, assign split_by = "forecast_mode"
and prediction samples will be made using the most recent samples in the test configuration (year, region, species).
spilt_by = "year_region_species"
If using year_region_species
, all split fields (year, region, species) must have a value defined for train
and test
year
One or multiple years as a character vector
region
One or multiple regions as a character vector
species
One or multiple species as a character vector
spilt_by = "fraction"
test_fraction
If splitting the data using a fraction, a fraction must be provided.
seed
A seed must be provided when testing on a fraction of the data.