Spartan: Expedited and Enriched Analyses Using Emulations & Ensembles

Kieran Alden, Jason Cosgrove


This vignette is a short guide on using thew new functionality in Spartan 3 to generate emulations of a simulator from a set of simulation results generated using latin-hypercube sampling, and from those emulations an ensemble. Once this is created, enriched analyses can be performed using Approximate Bayesian Computation and Multi Objective Evolutionary Computation algorithms. This vignette continues the description of the spartan techniques in the sensitivity and netlogo vignettes.

The case study

Our demonstration utilises data from our previously described agent-based lymphoid tissue development simulator (Patel et al., 2012; Alden et al., 2012). In this system two types of cells migrate into the developing gut and form patches of tissue at varying locations along the gut tract 72 hours later. These mature to form lymphoid organs that trigger an adaptive immune response to gut-based pathogens. Lab experimentation has revealed that cells behave in a different manner (in terms of their velocity and displacement over a 1 hour period) around an initial forming patch than elsewhere in the gut, for reasons not currently understood. To aid our understanding of tissue development, we adopted an agent-based simulation approach. In the example analyses demonstrated here, we are interested in two emergent cell behaviour responses: velocity and displacement, and how these are affected by the values assigned to simulation parameters that capture certain aspects of the biological system. These are captured at simulated hour twelve of tissue development: a time-point where we have experimental data which can be compared to simulation response. The simulator has six parameters for which there is uncertainty in parameter value, each previously described in depth (Alden et al., 2012).


The following are required to run the spartan methods described here:

Spartan Technique 6: Enriching Model Analysis Through Emulation

Using a surrogate tool in place of an original simulator, an emulator, can reduce resource requirements for model analysis and thus enrich understanding of how a model functions and relates back to the problem domain. Such an emulator would take as input a parameter value set and predict the simulation response, e.g. the median simulation response achieved over a high number of replicates. The saving in simulation run time, especially when performing complex sensitivity analyses such as those described in the techniques above, could have a significant impact on a research project. If the accuracy can be quantified and understood, the emulator provides a means of not only performing complex statistical analyses, but may also guide other experiments that could require the original simulation to be reparameterised.

To create an intial dataset for emulator development, use the Latin-Hypercube sampling function described in Technique 3. This dataset should then be partitioned. For any machine learning algorithm it is vital that the dataset being used is partitioned into training, test, and validation sets. This method does below just that. The default split is 75,15,10 respectively, however this can be altered as required. It is also vital for the techniques in this package that the data is normalised such that all data is between 0 and 1. This method also permits data normalisation prior to data partitioning, if normaliseFirst is set to TRUE. Where the parameter data is to be normalised, it is vital that the minimum and maximum values used in sampling for each parameter are provided: this cannot come from the minimum and maximum of the sample as this may not then be representive of the intended parameter space. One object is returned containing training, testing, and validation sets, and the specified minimum and maximum values used in normalisation to ensure normalised predictions generated by an emulator can be rescaled correctly.

# Read in the dataset. For ease of this tutorial, an exemplar set has been provided in the package,
# Simulation parameters
# Output measures
# Mins and max values used in sampling
sampleMaxes <- cbind(100,0.9,0.5,0.08,1,5)
sampleMins <-cbind(0,0.1,0.1,0.015,0.1,0.25)

## Partition the dataset, in this case normalising the data
partitionedData <- partition_dataset(sim_data_for_emulation, parameters, percent_train=75, percent_test=15, 
    percent_validation=10, normalise=TRUE, sample_mins = sampleMins, sample_maxes = sampleMaxes)
# If you want to specify a seed to use when randomly paritioning the data, add seed=[desired seed],
# eg seed=10

The visualise_data_distribution method can be used to provide diagnostic information for the dataset on which the emulators are being trained. A plot is produced that shows any skew in the data, with kurtosis statistics, to indicate any issues this could cause in emulation training. For example a positive skewed dataset may mean that simulation measures on the upper end of the scale may be less well predicted by the emulation techniques included in spartan. The method takes a dataset of a single simulation response as input and produces a graph in the working directory.

# If you wish, you can specify the number of bins to use in the plot. A default of 30 is used if not specified
visualise_data_distribution(partitionedData$training,"Displacement","Displacement_Diagnostic", num_bins = 10)

The training data is then used to aid supervised learning of simulation outputs by a number of machine learning algorithms. Currently spartan includes implementations of a neural network, random forest, support vector machine, general linear model, and gaussian process maps. Once these models have been trained, accuracy of prediction is assessed using the testing and validation subsets. Spartan includes methods to generate output and diagnostic plots to ease assessment of emulation performance.

Firstly, you may need to examine the method emulation_algorithm_settings. Some of the machine learning algorithms incorporated have their own specific settings (neural network and random forest, for example). To keep the methods generic to multiple algorithms and to save passing multiple objects around, these are declared in this function and returned as an algorithmSettings object. Where the user wishes to use the default values, there is no need to run this function: this will be created during emulator generation. However, should the user wish to change any of the settings, such as the number of generations in the neural network or the number of trees in random forest, they should run this method with the new values. In addition, this object will also contain two other settings: whether graphs should be plotted of the accuracy of the emulator against the training and test sets, and whether the emulator object that is created should be stored in the working directory. Parameters this object stores are detailed in the arguments section. However, for neural network emulation, the user is required to initialise this object with a list of neural network hidden layer structures to evaluate (see example). Should this not be done, an error message will be produced and the call will terminate.

# Set the algorithm settings. Let's say in this case we're generating a neural 
# network, so we need to feed in some potential structures to examine
# So change the default in the algorithm settings

The generate_requested_emulations method generates an emulator model from a training set for a specified technique, and generates performance statistics from the test set. The currently implemented techniques are a neural network (using the neuralnet package), a random forest (from the randomforest package), a support vector machine (from package e1071), a gaussian process model (from package mlegp), and a general linear model. Where a neural network is desired, the hyper-parameters are determined using k-fold cross validation from a set of specified network structures. Where a simulation has multiple outputs, an emulator model is created for each output response. This method provides capacity to save the generated emulator models to file, in Rda format, and plot a comparison of the predicted responses to a set of those of the training and test sets, giving correlation of determination (R-squared) and mean squared error values. The method returns a list of emulators of a specified technique, one for each simulation output, and the performance statistics for each measure, including the time taken to generate these emulators. If the training data has been normalised, minimum and maximum sampling values for each parameter are also returned such that any predictions generated using this emulation can be rescaled correctly. If plots are desired (by setting a flag in emulation_algorithm settings), plots produced are stored as PDF’s in the working directory. The same applies to saving the generated emulator, set by the saveEmulation flag in emulation_algorithm_settings. Note that it must be specified as to whether the data being provided in partitionedData has been normalised or not: this affects the output of the plots (as the data is rescaled back to its original scale if the data was normalised). Similarly to the rest of spartan, this method can create emulations for multiple timepoints.

# Run the algorithms, here we've given an example for all model types:
glm_fit <- generate_requested_emulations(c("GLM"),partitionedData, parameters,measures, normalised=TRUE)
svm_fit <- generate_requested_emulations(c("SVM"),partitionedData, parameters,measures, normalised=TRUE)
rf_fit <- generate_requested_emulations(c("RF"),partitionedData, parameters,measures, normalised=TRUE)
nn_fit <- generate_requested_emulations(c("NNET"),partitionedData, parameters,measures,algorithmSettings, normalised=TRUE)
gp_fit <- generate_requested_emulations(c("GP"),partitionedData, parameters,measures, normalised=TRUE)

The generate_predictions_from_emulator method generates predictions on unseen data. This method is called with the emulation object, parameters, meaasures, and unseen data. A flag should also be set as to whether the unseen data, and thus the generated prediction, need to be normalised and rescaled accordingly. Unseen data being input into the emulator must be scaled between 0 and 1, with predictions rescaled after generation. If using the partitioned data object created earlier, the validation set will have already been normalised.

# Generate some predictions for unseen data, let's use the validation set
# This has already been normalised:
predictions <- emulator_predictions(rf_fit, parameters, measures,   partitionedData$validation, normalise=FALSE, normalise_result=TRUE)

With these emulators generated, Technique 7 can be used to generate combinations of these machine learning algorithms, to generate what is more commonly known as an ensemble model.

Spartan Technique 7: Ensemble Generation

Technique six detailed the methods through which a series of emulators could be created from a simulation training set. Each emulator development method has the potential to perform very differently for the same dataset. As such, generating a prediction from a combination of emulators, rather than a single emulation alone, may provide an increase in performance. Technique 7 provides a means of creating this combination, an ensemble. This can either be completed from emulators that have been previously developed using Technique 6, or emulations can be created from scratch. In each case, the emulators performance over a dataset is assessed and the emulator weighted, to ensure more accurate emulators are given greater consideration when these are integrated into one ensemble.

This guide will detail the generation of an ensemble from previously generated emulators first, then show how this can be done from scratch in one go.

Where emulations have already been created (see the examples in Technique 6), generate_ensemble_from_existing_emulations combines these to form one ensemble. This takes as input a list of the emulator objects, the simulation parameters and output response labels, and a set of test data from which the performance weights will be evolved. We would recommend providing the testing set of the output from the partition_dataset method of Technique 6. An option is given, by setting these within emulation_algorithm_settings as in Technique 6, to save the ensemble object to file, as well as produce plots showing the accuracy of the generated ensemble for the test data set.

# Parameters, measures, partitionedData, and algorithmSettings as declared in Technique 6.
# List of the existing emulations created using Technique 6
existing_simulations <- list(glm_fit,svm_fit,rf_fit,nn_fit,gp_fit)
# Generate an ensemble from these
# Note the training on the test set, as the original emulators were trained using the partitioned data training set
generated_ensemble <- generate_ensemble_from_existing_emulations(existing_simulations,parameters, measures, partitionedData$testing,algorithm_settings=algorithmSettings, normalise=TRUE, timepoint=NULL)

The method generate_emulators_and_ensemble generates all emulators then combines these into one ensemble. This takes as input a list of the emulation objects to create (could be random forest, support vector machine, neural network, general linear model, and gaussian process model), the simulation parameters and output response labels, an object created by the partitioned_dataset method of Technique 6 (training, testing, and validation datasets), and an object created by method emulation_algorithm_settings. The latter sets key arguments used in emulation creation, and is detailed in Technique 6.

# List of model types to include:
modelList <- c("SVM","RF","GLM","NNET","GP")
# Simulation dataset, again using the provided data
# Partition the dataset (mins, maxes as declared in Technique 6)
partitionedData <- partition_dataset(sim_data_for_emulation, parameters, percent_train=75, percent_test=15, 
    percent_validation=10, normalise=TRUE, sample_mins = sampleMins, sample_maxes = sampleMaxes)
# Make the ensemble, specifying that the data above has been normalised. Parameters and measures as declared in Technique 6
# Recall you must have created an algorithmSettings object with the neural network structures to examine
generated_ensemble<-generate_emulators_and_ensemble(modelList, parameters, measures, partitionedData, 
                                                    algorithm_settings = algorithmSettings, normalised=TRUE)

With the ensemble generated, a final method is provided to make predictions using the new tool, use_ensemble_to_generate_predictions. Provide this method with the ensemble object, a set of unseen data (parameter values as rows), the simulation parameter and output response labels, and the predicted outputs for that parameter set will be generated. A flag is also set as to whether the unseen data needs to be normalised. Note that any unseen data must be scaled between 0 and 1 for the ensemble to generate a reliable prediction. If using the data partitioning method and using the validation data to make predictions, as below, the data will have already been normalised (so normalise_values is FALSE), yet the predictions should be translated back into their original scale (so normalise_result is TRUE).

# params is a vector of unseen parameter values for which a prediction is being sought.
predictions <- use_ensemble_to_generate_predictions(generated_ensemble,params,
    parameters,measures,normalise_values = FALSE, normalise_result = TRUE)

Spartan Technique 8: Spartan Analysis with Ensemble

Technique 6 detailed how a number of emulators could be constructed that may accurately predict a set of responses of a simulation, with Technique 7 detailing how these were integrated and their performance weighted in order to construct an ensemble. This provides a tool that can be used in place of the original simulation in performing a sensitivity analysis, using the techniques detailed in Techniques 3 and 4. This part of the manual demonstrates how this process could be undertaken, from sampling through running the ensemble and statistical analysis. Note that it is not possible to emulate the results of Technique 2, which require a distribution of results for each parameter set and not a single value prediction.

There are two additional methods that have been added to spartan 3.0 for this process. Both methods returns simulation output predictions in the required format to be analysed by Techniques 3 and 4 of this package.

emulate_lhc_sampled_parameters provides a wrapper to run the ensemble for latin-hypercube generated parameter sets in spartan Technique 3.

#### The examples below build on those detailed in Techniques 3-4, and 
#### as such that detail is referred to here rather than repeated.

#### Step 1: Use Spartan Technique 3 to generate samples, i.e:
#  Filepath: where the produced parameter sample file should be stored
filepath <- getwd()
# Declare parameters
parameters <- c("stableBindProbability","chemokineExpressionThreshold",
num_samples <- 500
# Mins and Maxes
pmaxes <- cbind(100,0.9,0.5,0.08,1,5)
pmins <-cbind(0,0.1,0.1,0.015,0.1,0.25)
# Normal or optimal sampling. Recall from Technique 3 that optimal is
# time intensive
algorithm <- "normal"

# Sampling call

# However, for the below we do provide an example set of parameters 
# as an object that can be loaded in

#### Step 2: Run the wrapper functions detailed above to emulate the 
#### simulation, predicting the output for each parameter set
# Tutorial parameter value sets, constructed using spartan
# Here we assume that Technique 7 has been used to generate an ensemble,
# and this is read in from file. This will be read into the object
# "built_ensemble"
# We haven't put an exemplar ensemble in the package due to the file size 
# of the object, but this can be downloaded from the package website
# However if you're following along from the code above, you could 
# instantiate built_ensemble to be generated_ensemble from the previous
# technique

# Declare responses:
measure_scale <- c("microns","microns/min")

# Now generate predictions:
# You can do this two ways:
# 1: Provide an object containing all the parameter value sets (i.e. emulated_lhc_values), as below:
# If you have loaded the data from within the package in, use the line below and ignore method 2
emulate_lhc_sampled_parameters(filepath, built_ensemble, parameters, measures, measure_scale, dataset = emulated_lhc_values, normalise = TRUE)
# 2: Or, if you have generated a CSV file with all the values, give the name of that file (should be in the specified filepath)
parameterFileName <- "LHC_Parameters_for_Runs.csv"
emulate_lhc_sampled_parameters(filepath, ensemble, parameters, measures, measure_scale, param_file = parameterFileName)

# The above will calculate all partial rank correlation coefficiejnts and graph the results into the present working directory

# If you wanted, you could do an lhc for a single emulation object, not just an ensemble. 
# If you wanted to do that, say for the neural network object nn_fit created in previous steps in this vignette, you would do:
emulate_lhc_sampled_parameters(filepath, nn_fit, parameters, measures, measure_scale, dataset = emulated_lhc_values, 
                               ensemble_set = FALSE, normalise = TRUE)

emulate_efast_sampled_parameters provides a wrapper to run the ensemble for parameter sets generated for performing the extended fourier amplitude sampling test, spartan Technique 4.

# Number of curves for eFAST
numCurves <- 3
# Number of samples to take from each curve
numSamples <- 65
# Where samples are stored
filepath <- getwd()
# Recall we need to add a dummy parameter for statistical comparison
parameters <- c("stableBindProbability","chemokineExpressionThreshold",
        "adhesionFactorExpressionSlope", "dummy")
# Max of all parameters. Include a max value for dummy (1 here)
pmax <- cbind(100,0.9,0.5,0.08,1,5,1)
# Min of all parameters. Include a min value for dummy (0 here)
pmin <-cbind(0,0.1,0.1,0.015,0.1,0.25,0)

#### Step 1: Use Spartan Technique 4 to generate samples
efast_generate_sample(filepath, numCurves,numSamples,parameters, pmin, pmax)
# As this method produces several CSV files (one per parameter/curve), and it is these files which are 
# read in to make the predictions, these have not been included as an object that can be loaded in. 
# However an example set is available on the project website. Extract that into the working folder and then run the below.

# Make predictions and plot the results
emulate_efast_sampled_parameters(filepath, built_ensemble, parameters, measures, numCurves, normalise = TRUE)

# Analyse:
efast_run_Analysis(filepath,measures,parameters,numCurves, numSamples,1:2,0.95,TRUE,"eFAST_Analysis.csv")

# If you wanted to use a single emulator rather than the ensemble, say nn_fit created earlier, you would do:
emulate_efast_sampled_parameters(filepath, nn_fit, parameters, measures, numCurves, normalise = TRUE, ensemble_set = FALSE)

Spartan Technique 9: Using Ensemble for Approximate Bayesian Computation with EasyABC

The ensemble model created using Technique 7 enables the performance of complex statistical analyses that could not be performed on more intensive simulations being emulated. The application of Approximate Bayesian Computation, to deduce the posterior distribution of the parameters, is one such technique, and one that could provide an increased understanding of simulation behaviour. The EasyABC package in R permits the performance of ABC techniques, and is well supported and documented. In running any of the numerous techniques in EasyABC, one has to provide a model, in which generated parameter value sets are processed and output responses generated. In order to make the ensemble models produced by spartan compatible with EasyABC, we have provided a wrapper which acts as the model input to the EasyABC call, which in turn produces predictions using the ensemble, normalising the EasyABC generated parameter set first if need be then rescaling the predictions to their original scale (recall the ensemble must take data scaled between 0 and 1).

The EasyABC model function, that utilises the wrapper detailed here to run the ensemble, can only take one parameter input: the parameter values. This is problematic as to generate a prediction for those values, we must provide the names of the simulation parameters and measures, the built ensemble, and whether or not the parameter set and responses have been normalised. To get around that problem, this method creates an object in the working directory that contains these values, and the ensemble abc wrapper provided in spartan can then read these in. Thus, this method MUST be run before using the EasyABC package with the ensemble:

## Four examples here of applying an ensemble to perform the  sequential ABC methods in EasyABC
# Firstly make sure the parameters and measures are declared
parameters <- c("stableBindProbability","chemokineExpressionThreshold",
# Load in an ensemble generated in Technique 7, this loads in as built_ensemble
# Whether the parameter sets generated by the ABC algorithm should be normalised 
# for input into the ensemble
normalise_values = TRUE
# Whether the predictions for the parameter set should be normalised
normalise_result = TRUE

# Establish the priors for each parameter

# Summary statistics to be targetted

# Create the file that will be read in by the wrapper:
create_abc_settings_object(parameters, measures, built_ensemble, normalise_values, 

The ensemble_abc_wrapper then provides a means of running the ensemble within the EasyABC methods. This method should be stated as the “model” argument of EasyABC methods such as ABC_sequential. Should the method above not have been run first, an error message will be produced. Here are a variety of examples of performing ABC analysis using the wrapper.

Finally, graph_Posteriors_All_Parameters provides a means of plotting the produced posterior distribution for all parameters for which the value is being explored.

# Beaumont method using spartan ensemble wrapper
    model=ensemble_abc_wrapper, prior=prior, 
    tolerance_tab=tolerance, verbose=TRUE)

# Graph the result
# Ranges:
sampleMins <- c(0,0.1,0.1,0.015,0.1,0.25)
sampleMaxes<- c(100,0.9,0.5,0.015,0.08,1.0,5.0)
    parameters, sampleMins, sampleMaxes)

# Delmoral method using spartan ensemble wrapper
decreaseSampleSizeEachStep <- 0.5
numberSimsPerformedPerParticle <- 1
minSampleSizeForResampling <- initialSims/2 
finalToleranceLevel <- 0.05

    model=ensemble_abc_wrapper, prior=prior, nb_simul=initialSims, 

# Graph the result
    parameters, sampleMins, sampleMaxes)

# Drovandi method:
finalToleranceLevel = 0.05
    alpha = proportionBestFitToUpdate, 
    nb_simul = initialSims, 

# Graph the result
    parameters, sampleMins, sampleMaxes)

Spartan Technique 10: Optimising Emulator Outputs with NSGA2

To determine emulator inputs that correspond to a desired output configuration we use the non-dominated sorting genetic algorithm II (nsgaII), a multiobjective genetic algorithm. In this scheme a solution is called nondominated, Pareto optimal, Pareto efficient or noninferior, if none of the objective functions can be improved in value without degrading some of the other objective values.

If the Pareto front comprises more members than the population size, a subset composed of those Pareto members having the largest fitness differences between their immediate neighbours summed for all objectives is selected. If the Pareto front comprises fewer members than the population size then members of the next front (those dominated by only one other solution) are selected in the same manner, and so on until the entire population has been selected. New solutions are generated through crossover of parents with mutation. Each candidate solution is assessed by a user defined fitness function, which nsga2 seeks to minimise

There are four methods to this process:

nsga2_set_user_params: Creates an object of the analysis parameters that will be used to evolve parameter sets or screen parameters for NSGA-2. The user should ensure this is called first, establishing this object such that it can be passed in to the relevant method.

set.nsga_sensitivity_params: Establish the parameters for the NSGA-2 sensitivity analysis, creating an object that is used within the method that screens NSGA-2 parameters.

screen_nsga2_parameters: This method performs a sensitvity analysis of key settings for the nsga2 algorithm. Different values of generation number, crossover and mutation rate are assessed and the values for each objective, along with the variance of the parameter inputs are written out to file in .csv format so the user can assess which settings are best suited to the chosen application. Values for the crossover and mutation distribution indices,used in simulated binary crossover are left at their default settings, but can be overwritten when running the emulator_parameter_evolution method.

emulator_parameter_evolution: This method takes a user specified fitness function and runs the nsga2 algorithm on an emulator using the nsga2 implementation provided in the mco package. The method requires the user to input parameter and algorithm settings as shown in the example below. The method outputs a list describing the final population in terms of the input parameters (par), the values for each objective (res), an evolved set of parameter inputs (par), and a boolean stating whether the candidate is pareto optimal (pareto.optimal)

# Declare a fitness function in a R script. Fitness function below can be downloaded from the website
fitness_function <- "eval_function.R"
# Read in the fitness function
# Load a ensemble generated by Technique 7
# The target values for objectives that the ngsa2 should try to produce using the emulator (minimising the error between prediction and these observed values).
desiredResponses <- c(4.3,24)
# Measures, parameters, sampleMins, sampleMaxes as those declared previously above
# Set the parameters and settings
sensitivtyParams <- set.nsga_sensitivity_params(generation_min=100, crossover_min=0.1,mutation_min=0.1, 
                                                generation_max=500,crossover_max=1.0, mutation_max=1.0, seed=500)

nsga2_params <- nsga2_set_user_params(built_ensemble, parameters, measures, desiredResponses, sampleMins, sampleMaxes)

nsga2_settings <- list("popsize"=100,"generations" = 100, "cprob" = 0.3, "mprob" = 0.1)

# Screen the parameters using nsga2
parameter_sensitivities <- screen_nsga2_parameters(evalfunction, nsga2_params, sensitivityParams, nsga2_settings)

# Evolve parameters
emulator_parameter_evolution(fitness_function, nsga2_params, nsga2_settings)