library(psptools)
Configuration files
Storing all of the parameters used in the forecast model in a configuration file
<- list(
cfg configuration="test",
image_list = list(tox_levels = c(0,10,30,80),
forecast_steps = 1,
n_steps = 3,
minimum_gap = 4,
maximum_gap = 10,
multisample_weeks="last",
toxins = c("gtx4", "gtx1", "dcgtx3", "gtx5", "dcgtx2", "gtx3",
"gtx2", "neo", "dcstx", "stx", "c1", "c2")),
model = list(balance_val_set=FALSE,
downsample=FALSE,
use_class_weights=FALSE,
dropout1 = 0.3,
dropout2 = 0.3,
batch_size = 32,
units1 = 32,
units2 = 32,
epochs = 128,
validation_split = 0.2,
shuffle = TRUE,
num_classes = 4,
optimizer="adam",
loss_function="categorical_crossentropy",
model_metrics=c("categorical_accuracy")),
train_test = list(split_by="year",
train = c("2015", "2016", "2017", "2018", "2019", "2020", "2021"),
test = c("2014"))
)
cfg
$configuration
[1] "test"
$image_list
$image_list$tox_levels
[1] 0 10 30 80
$image_list$forecast_steps
[1] 1
$image_list$n_steps
[1] 3
$image_list$minimum_gap
[1] 4
$image_list$maximum_gap
[1] 10
$image_list$multisample_weeks
[1] "last"
$image_list$toxins
[1] "gtx4" "gtx1" "dcgtx3" "gtx5" "dcgtx2" "gtx3" "gtx2" "neo"
[9] "dcstx" "stx" "c1" "c2"
$model
$model$balance_val_set
[1] FALSE
$model$downsample
[1] FALSE
$model$use_class_weights
[1] FALSE
$model$dropout1
[1] 0.3
$model$dropout2
[1] 0.3
$model$batch_size
[1] 32
$model$units1
[1] 32
$model$units2
[1] 32
$model$epochs
[1] 128
$model$validation_split
[1] 0.2
$model$shuffle
[1] TRUE
$model$num_classes
[1] 4
$model$optimizer
[1] "adam"
$model$loss_function
[1] "categorical_crossentropy"
$model$model_metrics
[1] "categorical_accuracy"
$train_test
$train_test$split_by
[1] "year"
$train_test$train
[1] "2015" "2016" "2017" "2018" "2019" "2020" "2021"
$train_test$test
[1] "2014"
Model configurations are .yaml files stored in ‘inst/configurations’. The configuration files can be read using read_config(filename)
.
configuration
The name of the configuration.
image_list
The section of the configuration file that defines how images will be mined out of the raw toxin measurement data.
tox_levels
A vector of integers that define the bins of total toxicity used to classify images.
forecast_steps
An integer defining how many steps ahead of the image the label comes from (for training) and the predictions will be made for (for forecasting).
n_steps
An integer defining the number of steps used to make an image.
minimum_gap
An integer defining the minimum number of days between samples in order to use them to make an image.
maximum_gap
An integer defining the maximum number of days between samples in order to use them to make an image.
multisample_weeks
A character defining the method used to pick a sample when a site has been visited multiple times in a week. Options are ‘first’, ‘last’, ‘minimum’, ‘maximum’.
toxins
A list of character names of the toxins (or other variables) to keep from each sample.
model
balance_val_set
Logical indicating whether or not to balance the class distribution in the validation set. The fit()
function accepts either a percent of the training set to randomly sample to create the validation set. Alternatively, the samples in the validation set can be specifically assigned. balance_val_set = TRUE
will withhold the validation_split
from each of the possible classes.
downsample
Logical indicating whether or not to balance the class distribution in the training set. downsample=TRUE
will randomly sample the number of samples in the lowest class from the other three.
use_class_weights
Logical indicating whether or not to use class weights during training.
dropout1
Double defining the dropout rate for the input layer.
dropout2
Double defining the dropout rate for the hidden layer.
batch_size
Integer defininf the batch size.
units1
Integer defining the number of units in the input layer.
units2
Integer defining the number of units in the hidden layer.
epochs
Integer defining the number of training epochs.
validation_split
Double defining the portion of the training set to withhold for validation during training.
shuffle
Logical indicating whether or not to shuffle the training set before splitting off the validation set.
num_classes
Integer defining the number of classes.
optimizer
Character devining the optimizer.
loss_function
Character defining the loss function to use during training.
model_metrics
Character defining the metric to use during training.
train_test
Defines how to split the data being used for training and testing (or forecasting). If the model is in forecast_mode, the test set should be set to FORECAST_IMAGE
.
split_by
The package offers two options for splitting the train and test sets - “year” or “fraction”.
train
A list of years as character.
test
Multiple or one year to test the model on, or FORECAST_IMAGE
to predict using the last image that can be created at each site in the training set.
test_fraction
If splitting the data using a fraction, a fraction must be provided.
seed
A seed must be provided when testing on a fraction of the data.