Configuration files

Storing all of the parameters used in the forecast model in a configuration file

library(psptools)
cfg <- list(
  configuration="test",
  image_list = list(tox_levels = c(0,10,30,80),
                    forecast_steps = 1,
                    n_steps = 3,
                    minimum_gap = 4,
                    maximum_gap = 10,
                    multisample_weeks="last",
                    toxins = c("gtx4", "gtx1", "dcgtx3", "gtx5", "dcgtx2", "gtx3", 
                               "gtx2", "neo", "dcstx", "stx", "c1", "c2")),
  model = list(balance_val_set=FALSE,
               downsample=FALSE,
               use_class_weights=FALSE,
               dropout1 = 0.3,
               dropout2 = 0.3,
               batch_size = 32, 
               units1 = 32, 
               units2 = 32, 
               epochs = 128, 
               validation_split = 0.2,
               shuffle = TRUE,
               num_classes = 4,
               optimizer="adam",
               loss_function="categorical_crossentropy",
               model_metrics=c("categorical_accuracy")),
  train_test = list(split_by="year",
                    train = c("2015", "2016", "2017", "2018", "2019", "2020", "2021"), 
                    test = c("2014"))
)

cfg
$configuration
[1] "test"

$image_list
$image_list$tox_levels
[1]  0 10 30 80

$image_list$forecast_steps
[1] 1

$image_list$n_steps
[1] 3

$image_list$minimum_gap
[1] 4

$image_list$maximum_gap
[1] 10

$image_list$multisample_weeks
[1] "last"

$image_list$toxins
 [1] "gtx4"   "gtx1"   "dcgtx3" "gtx5"   "dcgtx2" "gtx3"   "gtx2"   "neo"   
 [9] "dcstx"  "stx"    "c1"     "c2"    


$model
$model$balance_val_set
[1] FALSE

$model$downsample
[1] FALSE

$model$use_class_weights
[1] FALSE

$model$dropout1
[1] 0.3

$model$dropout2
[1] 0.3

$model$batch_size
[1] 32

$model$units1
[1] 32

$model$units2
[1] 32

$model$epochs
[1] 128

$model$validation_split
[1] 0.2

$model$shuffle
[1] TRUE

$model$num_classes
[1] 4

$model$optimizer
[1] "adam"

$model$loss_function
[1] "categorical_crossentropy"

$model$model_metrics
[1] "categorical_accuracy"


$train_test
$train_test$split_by
[1] "year"

$train_test$train
[1] "2015" "2016" "2017" "2018" "2019" "2020" "2021"

$train_test$test
[1] "2014"

Model configurations are .yaml files stored in ‘inst/configurations’. The configuration files can be read using read_config(filename).

configuration

The name of the configuration.

image_list

The section of the configuration file that defines how images will be mined out of the raw toxin measurement data.

tox_levels

A vector of integers that define the bins of total toxicity used to classify images.

forecast_steps

An integer defining how many steps ahead of the image the label comes from (for training) and the predictions will be made for (for forecasting).

n_steps

An integer defining the number of steps used to make an image.

minimum_gap

An integer defining the minimum number of days between samples in order to use them to make an image.

maximum_gap

An integer defining the maximum number of days between samples in order to use them to make an image.

multisample_weeks

A character defining the method used to pick a sample when a site has been visited multiple times in a week. Options are ‘first’, ‘last’, ‘minimum’, ‘maximum’.

toxins

A list of character names of the toxins (or other variables) to keep from each sample.

model

balance_val_set

Logical indicating whether or not to balance the class distribution in the validation set. The fit() function accepts either a percent of the training set to randomly sample to create the validation set. Alternatively, the samples in the validation set can be specifically assigned. balance_val_set = TRUE will withhold the validation_split from each of the possible classes.

downsample

Logical indicating whether or not to balance the class distribution in the training set. downsample=TRUE will randomly sample the number of samples in the lowest class from the other three.

use_class_weights

Logical indicating whether or not to use class weights during training.

dropout1

Double defining the dropout rate for the input layer.

dropout2

Double defining the dropout rate for the hidden layer.

batch_size

Integer defininf the batch size.

units1

Integer defining the number of units in the input layer.

units2

Integer defining the number of units in the hidden layer.

epochs

Integer defining the number of training epochs.

validation_split

Double defining the portion of the training set to withhold for validation during training.

shuffle

Logical indicating whether or not to shuffle the training set before splitting off the validation set.

num_classes

Integer defining the number of classes.

optimizer

Character devining the optimizer.

loss_function

Character defining the loss function to use during training.

model_metrics

Character defining the metric to use during training.

train_test

Defines how to split the data being used for training and testing (or forecasting). If the model is in forecast_mode, the test set should be set to FORECAST_IMAGE.

split_by

The package offers two options for splitting the train and test sets - “year” or “fraction”.

train

A list of years as character.

test

Multiple or one year to test the model on, or FORECAST_IMAGE to predict using the last image that can be created at each site in the training set.

test_fraction

If splitting the data using a fraction, a fraction must be provided.

seed

A seed must be provided when testing on a fraction of the data.