# Backtesting with strand

## Introduction

Note: this document assumes familiarity with the notion of investment strategy backtesting. For an introduction, see the article Backtests in R-News Volume 7/1.

Evaluating an investment strategy is a multi-step process. Often the first step is to scrutinize a strategy’s underlying signal, or alpha, by running a top-bottom quartile spread analysis using a tool like the R package backtest. A quartile analysis gives a good idea as to whether the signal is predictive of future returns. However, the analysis ignores many real-world aspects of strategy implementation and trading. For example, in a spread analysis it is assumed that we can trade immediately, regardless of actual liquidity. As a result, it is difficult to learn from a spread analysis how the performance of a strategy degrades as investment capital and portfolio size increases. In a sophisticated backtest more of an attempt is made to mimic how the strategy would be implemented in practice. This includes using optimization-based portfolio construction and trade selection, and making conservative assumptions about what trades could actually have been made in the market.

The strand package provides a framework for running this more realistic type of backtest. Once a strategy is defined in terms of its alpha, risk constraints, and position and turnover limits, the system simulates how the strategy would be operated day-by-day, including daily order generation and realistic trade filling.

The purpose of this vignette is to describe how to set up and run a strand simulation.

## System overview

The strand system is meant to mimic a daily professional-level portfolio management process. The process involves the following steps:

• Prepare input data.
• Specify the strategy:
• Define a universe of stocks in which to invest.
• Its level of capital to which we want to trade.
• The alpha to which we want to maximize exposure.
• Any exposure constraints we want to impose, e.g., that the portfolio’s exposure to any one sector must be within the range +/- 5%, or that the portfolio’s exposure to a numeric factor like beta must fall within the range +/- 1%.
• Any liquidity constraints on trading individual stocks, e.g., that we don’t want to submit an order for a stock that is greater than 5% of the volume we expect to see for that stock in the market.

### Position size constraints

Position sizes are controlled by the parameters position_limit_pct_lmv and position_limit_pct_smv:

strategies:
strategy_1:
position_limit_pct_lmv: 1
position_limit_pct_smv: 1

These parameters express limits on the size of positions as a percentage of the target long and short market values for the portfolio (calculated in the previous example). In our example both are set to 1, which means that the maximum size for a long position is 1% times $1mm =$10,000, and the maximum size for a short position is 1% times -$1mm = -$10,000.

### Liquidity constraints

There are two ways in which the rc_vol measure in our inputs data is used to impose liquidity constraints on our portfolio construction process.

First, the position_limit_pct_adv parameter is used to impose a position size constraint in addition to the constraints discussed in the previous section. This parameter limits the position size, in absolute value, to a percentage of the rc_vol measure in the current day’s input data. In sample.yaml we have:

strategies:
strategy_1:
position_limit_pct_adv: 30

which means a position in our simulation can be no greater than 30% of our rc_vol value. For example, suppose the average volume measure for security ABC is $10M. This means that the liquidity constraint imposes a limit on the size of long positions in ABC of$3M and a limit on the size of short positions of -$3mm. Second, the trading_limit_pct_adv parameter limits the size of the order that can be generated for a stock. The idea is to size orders to be in line with the amount of trading expected in the market and any limitation we are planning on imposing on participation. It’s difficult to control exposures effectively if we don’t do this! In the example configuration file, we have: strategies: strategy_1: trading_limit_pct_adv: 5 which means that we can buy at most 5% or sell at most 5% of a security’s rc_vol measure on a given day. Continuing our example above, if our measurement on a given day is that ABC is trading on average$10M per day, we can buy at most $500,000 or sell at most$500,000 on that day.

### Factor constraints

Factor constraints limit the amount of exposure we can have in our optimization to a given numeric value. That is, we impose an upper and/or lower bound on the product of our position weights and the numeric value. In the context of exposure constraints, a position weight means the signed market value of a position divided by the strategy’s strategy_capital value.

In this vignette’s example we impose factor constraints on size:

strategies:
strategy_1:
constraints:
size:
type: factor
in_var: size
upper_bound: 0.01
lower_bound: -0.01

Recall that size must be a column present in the daily inputs data. Each constraint is configured in a separate entry in the constraints section that contains the following key/value pairs:

• type: in this case factor because we are constraining exposure to a numeric value.
• in_var: the name of the column in the input data that contains the numeric value.
• upper_bound: the upper bound on exposure to in_var.
• lower_bound: the lower bound on exposure to in_var.

In our example we are limiting exposure to size to be within +/-1%.

### Category exposure constraints

Category exposure constraints are similar to factor exposure constraints. A category constraint imposes a limit on the exposure (i.e., the sum of the position weights) within each level of a category. In our example we have a single constraint on sector:

strategies:
strategy_1:
constraints:
sector:
type: category
in_var: sector
upper_bound: 0.02
lower_bound: -0.02

Here, sector must be a column that appears in the security reference or the simulation’s input data. As with factor constraints, the category constraint is defined in its own entry in the constraints section of the configuration file for strategy_1 with the following key/value pairs:

• type: in this case category because we are constraining exposure to the levels of a category.
• in_var: the name of the column in the security reference that contains the category level for each security.
• upper_bound: the upper bound on exposure for each level of in_var.
• lower_bound: the lower bound on exposure for each level of in_var.

There are 11 levels in sector in our security reference:

data(sample_secref)
sample_secref %>%
group_by(sector) %>%
summarise(count = n()) %>%
print(n = Inf)
#> # A tibble: 11 x 2
#>    sector                 count
#>    <chr>                  <int>
#>  1 Communication Services    24
#>  2 Consumer Discretionary    58
#>  3 Consumer Staples          31
#>  4 Energy                    26
#>  5 Financials                62
#>  6 Health Care               62
#>  7 Industrials               71
#>  8 Information Technology    71
#>  9 Materials                 28
#> 10 Real Estate               31
#> 11 Utilities                 28

The constraint above indicates that in our optimization we may have no more than +/-2% of exposure in any one of these levels.

## Running the simulation

At this point we have covered all of the setup required to run the backtest. We have prepared our data, including the security master and daily inputs and market data. We have filled in the configuration file to specify the strategy and control different aspects of the simulator. We are ready to run the backtest.

The strand package is implemented using the R6 OOP system. To run the backtest, we create a Simulation object by passing the path to the yaml configuration file and the three data sets discussed above to the constructor. Then we call the method run():

data(sample_inputs)
data(sample_pricing)
data(sample_secref)

sim <- Simulation$new(config = "sample.yaml", raw_input_data = sample_inputs, raw_pricing_data = sample_pricing, security_reference_data = sample_secref) sim$run()

### Viewing summary statistics

When the backtest is finished, we can call methods to summarize and plot the results. For example, the overallStatsDf() method returns a data frame of key statistics:

#### Market values

The plotMarketValue() method plots gross market value (GMV), net market value (NMV), long market value (LMV) and short market value (SMV) over time:

sim$plotMarketValue() #### Category exposures The plotCategoryExposure() method shows the exposure over time within each level of a given category. Below we plot the exposure within the levels of sector, which in our backtest has an exposure constraint of +/-2%: sim$plotCategoryExposure("sector")

Note that there are some cases where the exposure to a level of sector falls outside of +/-2% despite the category exposure constraint we impose during portfolio construction. This can be due to the following:

• Prices for securities in the level of the category are rising (in the case of an exposure that is too positive) or falling (in the case of an exposure that is too negative). The portfolio construction step uses prices as of the start of the day, while the plot above shows exposures at the end of the day. So even if constraints are within bounds using starting market values they could be out of bounds at the end of the day due to price movement.
• Lack of liquidity. The trades that we need to make to bring an exposure back within bounds could be left unfilled due to a lack of liquidity. Recall that we configured our backtest to only allow fills up to 4% of the number of shares traded in the market.
• Loosened constraints. It could be the case that no set of trades can be found to bring an exposure that has drifted back within bounds and that the constraint needed to be loosened.

Exploring these scenarios is possible by looking at lower-level backtest results but is outside the scope of this vignette.

#### Factor exposures

The plotFactorExposure() method shows the portfolio exposure over time to one or more factors. Below we plot the exposure to size, which in our backtest has an exposure constraint of +/-1%:

sim\$plotFactorExposure(c("size"))

Here we can also see spikes of exposure outside the constrained range +/-1%. As discussed in the previous section, price movement, lack of liquidity and constraint loosening are possible explanations for these spikes in end-of-day exposure. In the case of factor constraints, another possible explanation is a significant day-over-day change in factor values. Again, exploring these scenarios is possible by diving more deeply into the backtest’s result data, but is outside the scope of this document.

## Appendix: file-based inputs

The first part of the vignette showed how to run a simulation where all data is supplied using objects in memory. In this appendix we discuss setting up a simulation where all data comes from binary feather files stored on disk. This approach is useful for running simulations over many periods with a small memory footprint.

In this section we assume we have a directory called sample_data in the vignettes directory that contains security reference, pricing, and alpha/factor input data in feather format. An archive of sample data that matches the configuration below is available for download from the package’s GitHub repository for experimentation.

### Security reference

The simulation’s configuration file should be set as follows for file-based security reference input:

simulator:
secref_data:
type: file
filename: sample_data/secref.feather

The type field specifies that secref information should come from a file and not be passed to the simulator as a constructor parameter. The filename field gives the location of the file.

### Alpha and factor inputs

When using file-based data, strand expects alpha and factor input data for each day to be stored in its own file. The /simulator/input_data/directory and /simulator/input_data/prefix configuration options specify where the system should find these files:

simulator:
input_data:
type: file
directory: sample_data/inputs
prefix: inputs

This entry indicates that the input data should be retrieved from files located in sample_data/inputs with filename prefix inputs. By convention the data for YYYYmmdd should have filename prefix_YYYYmmdd.feather. Therefore the file sample_data/inputs/inputs_20190104.feather contains the alpha and risk values that will be used for trading on 2019-01-04.

### Market data

Like alpha and factor inputs, strand expects pricing data for each day to be stored in its own file. The /simulator/pricing_data/directory and /simulator/pricing_data/prefix configuration options specify where the system should find these files:

simulator:
pricing_data:
type: file
directory: sample_data/pricing
prefix: pricing

The /simulator/pricing_data/directory value specifies the file system location for the market data files. The value of /simulator/pricing_data/prefix indicates the prefix for each file name. So in our example, the market data for 2019-01-04 will be found in sample_data/pricing/pricing_20190104.feather.