Multi Objective Optimization tutorial

This tutorial is an in depth example of the use of this package in the context of an evolutionary optimization approach. It also shows the use of gene_eval to calculate a reaction speed from gene expressions.

First, we need to load the appropriate libraries. You may also need to load your optimization library of choice.

library(dplyr)
library(purrr)
library(stringr)
library(fbar)

This code block loads a model, then extracts the list of genes from the model. The model takes the form of a tabular list of reactions.

data("ecoli_core")
model <- ecoli_core

genes_in_model <- model$geneAssociation %>%
  str_split('[()|& ]+') %>%
  flatten_chr() %>%
  discard(is.na) %>%
  discard(~ str_length(.x)==0)

The evaluation function is where the actual metabolic simulations are performed. This has four main stages:

The technique of fixing the biomass followed by maximizing the synthetic objective is important because there could still be slack in the model after the first optimization stage, and we wish to have a reliable synthetic objective estimate.

evaluation_function <- function(genome){

  res <- model %>%
    mutate(activation = gene_eval(geneAssociation, names(genome), genome),
           activation = coalesce(activation, 1),
           uppbnd = uppbnd*activation,
           lowbnd = lowbnd*activation) %>%
    find_fluxes_df(do_minimization = FALSE) %>%
    mutate(lowbnd = ifelse(abbreviation=='Biomass_Ecoli_core_w/GAM', flux*0.99, lowbnd),
           uppbnd = ifelse(abbreviation=='Biomass_Ecoli_core_w/GAM', flux*1.01, uppbnd),
           obj_coef = 1*(abbreviation=='EX_ac(e)')) %>%
    find_fluxes_df(do_minimization = FALSE)

  return(list(bm = filter(res, abbreviation=='Biomass_Ecoli_core_w/GAM')$flux, 
              synth = filter(res, abbreviation=='EX_ac(e)')$flux))
}

Non-domination sorting is the first stage of the selection procedure in NSGA-II. The code might be quite opaque, but the idea is as follows:

non_dom_sort <- function(input){
  input_long <- input %>%
    gather(property, value, -id) %>%
    mutate(front=NA)

  currentfront <- 1

  while(any(is.na(input_long$front))){

    input_long <- input_long %>%
      inner_join(.,., by='property') %>%
      group_by(id.x,id.y) %>%
      mutate(dominance = ifelse(all(value.x>=value.y), 
                                'xdomy', 
                                ifelse(all(value.y>=value.x), 
                                       'ydomx', 
                                       'nondom'
                                       )
                                )
      ) %>%
      group_by(id.x) %>%
      mutate(front = ifelse(all(dominance[is.na(front.y)] %in% c('xdomy', 'nondom')), 
                            pmin(currentfront, front.x, na.rm=TRUE), 
                            NA
                            )
      ) %>%
      group_by(id = id.x, property = property, front, value = value.x) %>%
      summarise()

    currentfront <- currentfront + 1
  }

  return(
    input_long %>%
      spread(property, value)
  )

}

The second part of the NSGA-II evaluation procedure is finding the crowding distance. This is used to break ties between points in the same non-dominated front. In for each front, for each dimension, this function sorts the points into order along the dimension, and finds the normalized distance between the proceeding point and succeeding point. These values are summed up across each dimension to find the value for the point.

crowding_distance <- function(input){
  return(
    input %>%
      gather(property, value, -id, -front) %>%
      group_by(front, property) %>%
      arrange(value) %>%
      mutate(crowding = (lead(value)-lag(value))/(max(value)-min(value)),
             crowding = ifelse(is.na(crowding),Inf, crowding)) %>%
      group_by(id) %>%
      mutate(crowding = sum(crowding)) %>%
      spread(property, value)
  )
}

This is the genetic loop of the algorithm. It is explained by code comments, but follows a normal pattern of evaluating, sorting, selecting from and mutating the population.

start_genome <- set_names(rep_along(genes_in_model, TRUE), genes_in_model)
pop <- list(start_genome)

popsize = 50

for(i in 1:50){
  results <- map_df(pop, evaluation_function) %>% # Evaluate all the genomes
    mutate(bm=signif(bm), synth=signif(synth)) %>% # Round results
    unique() %>% # Throw away duplicates
    mutate(id = 1:n()) %>% # label the results
    sample_frac() %>% # Shuffle
    non_dom_sort() %>% # Find the non-dominated fronts
    crowding_distance %>% # Find the crowding distances
    arrange(front, desc(crowding)) # Sort by front, breaking ties by crowding distance

  selected <- results %>%
    filter(row_number() <= popsize/2) %>% # Keep the best half of the population
    getElement('id')

  kept_pop <- pop[selected]
  altered_pop <- kept_pop %>%
    sample(popsize-length(selected), TRUE) %>% # Select a random portion of the population as parents
    map(function(genome){xor(genome, runif(length(genome))>0.98)}) # Mutate a small number of genes from the parent population.

  pop <- unique(c(kept_pop, altered_pop)) # Combine the ofspring and parent populations
}

Now that we have results, the set of all non-dominated points, known as the Pareto Front. This describes the tradeoff between biomass production and our synthetic objective.

library(ggplot2)

results %>%
  filter(front==1) %>%
  ggplot(aes(x=bm, y=synth, colour=front)) + 
  geom_point() + 
  geom_step(direction='vh') +
  scale_x_log10() +
  scale_y_log10() +
  theme_bw()