Configuration files

Storing all of the parameters used in the forecast model in a configuration file

library(psptools)

cfg <- list(
  configuration="test",
  image_list = list(tox_levels = c(0,10,30,80),
                    forecast_steps = 1,
                    n_steps = 3,
                    minimum_gap = 4,
                    maximum_gap = 10,
                    multisample_weeks="last",
                    toxins = c("gtx4", "gtx1", "dcgtx3", "gtx5", "dcgtx2", "gtx3", 
                               "gtx2", "neo", "dcstx", "stx", "c1", "c2")),
  model = list(balance_val_set=FALSE,
               downsample=FALSE,
               use_class_weights=FALSE,
               dropout1 = 0.3,
               dropout2 = 0.3,
               batch_size = 32, 
               units1 = 32, 
               units2 = 32, 
               epochs = 128, 
               validation_split = 0.2,
               shuffle = TRUE,
               num_classes = 4,
               optimizer="adam",
               loss_function="categorical_crossentropy",
               model_metrics=c("categorical_accuracy")),
  train_test = list(split_by="year_region_species",
                    train = list(
                      year = c("2015", "2016", "2017", "2018", "2019", "2020", "2021"),
                      region = c("maine"),
                      species = c("mytilus")), 
                    test = list(
                      year = c("2014"),
                      region= c("maine"),
                      species = c("mytilus")))
)

Model configurations are list objects saved as .yaml files in ‘inst/configurations’. The configuration files can be read using read_config(filename).

`configuration`

The name of the configuration.

`image_list`

The section of the configuration file that defines how images will be mined out of the raw toxin measurement data.

`tox_levels`

A vector of integers that define the bins of total toxicity used to classify images.

`forecast_steps`

An integer defining how many steps ahead of the image the label comes from (for training) and the predictions will be made for (for forecasting).

`n_steps`

An integer defining the number of steps used to make an image.

`minimum_gap`

An integer defining the minimum number of days between samples in order to use them to make an image.

`maximum_gap`

An integer defining the maximum number of days between samples in order to use them to make an image.

`multisample_weeks`

A character defining the method used to pick a sample when a site has been visited multiple times in a week. Options are ‘first’, ‘last’, ‘minimum’, ‘maximum’.

`toxins`

A list of character names of the toxins (or other variables) to keep from each sample.

`model`

`balance_val_set`

Logical indicating whether or not to balance the class distribution in the validation set. The fit() function accepts either a percent of the training set to randomly sample to create the validation set. Alternatively, the samples in the validation set can be specifically assigned. balance_val_set = TRUE will withhold the validation_split from each of the possible classes.

`downsample`

Logical indicating whether or not to balance the class distribution in the training set. downsample=TRUE will randomly sample the number of samples in the lowest class from the other three.

`use_class_weights`

Logical indicating whether or not to use class weights during training.

`dropout1`

Double defining the dropout rate for the input layer.

`dropout2`

Double defining the dropout rate for the hidden layer.

`batch_size`

Integer defininf the batch size.

`units1`

Integer defining the number of units in the input layer.

`units2`

Integer defining the number of units in the hidden layer.

`epochs`

Integer defining the number of training epochs.

`validation_split`

Double defining the portion of the training set to withhold for validation during training.

`shuffle`

Logical indicating whether or not to shuffle the training set before splitting off the validation set.

`num_classes`

Integer defining the number of classes.

`optimizer`

Character devining the optimizer.

`loss_function`

Character defining the loss function to use during training.

`model_metrics`

Character defining the metric to use during training.

`train_test`

Defines how to split the data being used for training and testing (or forecasting).

`split_by`

The package offers three options for splitting the train and test sets - “year_region_species”, “fraction” or “function”. If forecasting, assign split_by = "forecast_mode" and prediction samples will be made using the most recent samples in the test configuration (year, region, species).

`spilt_by = "year_region_species"`

If using year_region_species, all split fields (year, region, species) must have a value defined for train and test

`year`

One or multiple years as a character vector

`region`

One or multiple regions as a character vector

`species`

One or multiple species as a character vector

`spilt_by = "fraction"`

`test_fraction`

If splitting the data using a fraction, a fraction must be provided.

`seed`

A seed must be provided when testing on a fraction of the data.