LARGE-SCALE BACKTESTING IN 5 MINUTES
Stop Loss, Trailing Stop, or Take Profit? 2 Million Backtests Shed Light
In this article, we will utilize large-scale backtesting with vectorbt to explore the performance of the most common stop signals for different cryptocurrencies, time periods, and stop values.
A trading strategy is just a grain of sand when compared to the whole universe of possible strategies; only the big picture can reveal its quality.
Our goal is to utilize large-scale backtesting to compare the performance of trading with and without stop loss (SL), trailing stop (TS), and take profit (TP) signals. To make this attempt representative, we will run a huge number of experiments across three different dimensions: instruments, time, and hyperparameters:
- First, we will pick 10 cryptocurrencies by market capitalization (except stablecoins such as USDT) and fetch 3 years of their daily pricing data. In particular, we aim at backtesting the time period from 2018 to 2021 as it contains periods of sharp price drops (e.g., corrections due to ATH in December 2017 and coronavirus in March 2020) as well as surges (ATH in December 2020) — this keeps things balanced.
- For each instrument, we will split this time period into 400 smaller (and overlapping) time windows, each 6 months long. We will run our tests on each of these windows to account for different market regimes.
- For each instrument and time window, we will then generate an entry signal at the very first bar and find an exit signal according to the stop configuration. We will test 100 stop values with a 1% increment and compare the performance of each one to that of trading randomly and holding within this particular time window.
In total, we will conduct 2,000,000 backtests.
All we need is Jupyter Notebook/Lab with Python ≥ 3.6, yfinance, vectorbt, and packages required by them. We will use yfinance to download pricing data, and vectorbt to both run 2 million backtests in under 5 minutes and analyze the results visually.
vectorbt is a next-generation backtesting library for Python that applies various backtesting and data science techniques to technical analysis. The way it works is by representing trading data — from time series to order records — as nd-arrays, and processing them using NumPy and Numba. This in turn enables use cases such as blazingly fast hyperparameter optimization, which is otherwise mainly done using distributed and cloud computing. Another advantage is integration of Plotly and ipywidgets to display interactive charts and dashboards right in the Jupyter notebook.
The first step is to define the parameters of the analysis pipeline. As discussed above, we will backtest 3 years of pricing data, 400 time windows, 10 cryptocurrencies, and 100 stop values. We will also set fees and slippage both to 0.25% and initial capital to $100 (the amount per se doesn’t matter, but it must be the same for all assets to be comparable). Feel free to change any parameter of interest.
Start date 2018-01-01 00:00:00
End date 2021-01-01 00:00:00
Time period (days) 1096
Window length 180 days, 0:00:00
Exit types 5
Stop values 100
Tests per asset 200000
Tests per window 5000
Tests per exit type 400000
Tests per stop type and value 4000
Tests total 2000000
Our configuration yields sample sizes with enough statistical power to analyze four variables: assets (200k tests per asset), time (5k tests per time window), exit types (400k tests per exit type), and stop values (4k tests per stop type and value). Similar to how Tableau handles dimensions and measures, we will be able to group our performance by each of these variables, but we will mainly focus on 5 exit types: SL exits, TS exits, TP exits, random exits, and holding exits (placed at the last bar).
Getting daily pricing data of each cryptocurrency is straightforward using yfinance:
dict_keys(['BTC-USD', 'ETH-USD', 'XRP-USD', 'BCH-USD', 'LTC-USD', 'BNB-USD', 'EOS-USD', 'XLM-USD', 'XMR-USD', 'ADA-USD'])
ohlcv_by_symbol now contains OHLCV data by cryptocurrency name. Each DataFrame has 1083 rows (days) and 5 columns (O, H, L, C, and V). You can plot a DataFrame as follows:
Since assets are one of the dimensions we want to analyze, vectorbt expects us to pack them as columns into a single DataFrame and label them accordingly. To do so, we simply swap assets and features to get a dictionary of DataFrames (with assets now as columns) keyed by feature name, such as “Open”.
dict_keys(['Open', 'Low', 'High', 'Close', 'Volume'])
Generate time windows
Next, we will move a 6-month sliding window over the whole time period and take 400 “snapshots” of each price DataFrame within this window. Each snapshot will correspond to a subset of data that should be independently backtested. As with assets and other variables, snapshots also need to be stacked horizontally as columns. As a result, we will get 180 rows (window length in days) and 4000 columns (10 assets x 400 windows); that is, one column will correspond to the price of one asset within one particular time window.
A nice feature of vectorbt is that it makes use of hierarchical indexing to store valuable information on each backtest. It also ensures that this column hierarchy is preserved across the whole backtesting pipeline — from signal generation to performance modeling — and can be extended easily. Currently, our columns have the following hierarchy:
('BTC-USD', '2017-12-31', '2018-06-28'),
('BTC-USD', '2018-01-02', '2018-06-30'),
('BTC-USD', '2018-01-05', '2018-07-03'),
('ADA-USD', '2020-06-16', '2020-12-12'),
('ADA-USD', '2020-06-19', '2020-12-15'),
('ADA-USD', '2020-06-21', '2020-12-17')
This multi-index captures three parameters: the symbol, the start date of the time window, and its end date. Later, we will extend this multi-index with exit types and stop values such that each of the 2 million backtests has its own price series.
Generate entry signals
In contrast to most other backtesting libraries, signals are not stored as a signed integer array, but they are split into two boolean arrays: entries and exits, which makes manipulation a lot easier.
At the beginning of each time window, let’s generate an entry signal indicating a buy order. The data frame will have the same shape, index, and columns as that of price so that vectorbt can link their elements together.
Generate exit signals
For each of the entry signals we generated, we will find an exit signal according to our 5 exit types: SL, TS, TP, random, and holding. We will also concatenate their DataFrames into a single (huge) DataFrame with 180 rows and 2,000,000 columns, each representing a separate backtest. Since exit signals are boolean, their memory footprint is tolerable.
Let’s generate exit signals according to stop conditions first. We want to test 100 different stop values with a 1% increment, starting from 1% and ending with 100% (i.e., find a timestamp where the price exceeds the entry price by 100%). Usually, when OHLC data is checked against such conditions, the position is closed at (or shortly after) the time of hitting the particular stop, but we will simplify things and use the “Close” price to exit any position.
(180, 400000) (180, 400000) (180, 400000)
This also extended our column hierarchy with a new column level indicating the stop value, we only have to make it consistent across all DataFrames:
(0.01, 'BTC-USD', '2017-12-31', '2018-06-28'),
(0.01, 'BTC-USD', '2018-01-02', '2018-06-30'),
(0.01, 'BTC-USD', '2018-01-05', '2018-07-03'),
( 1.0, 'ADA-USD', '2020-06-16', '2020-12-12'),
( 1.0, 'ADA-USD', '2020-06-19', '2020-12-15'),
( 1.0, 'ADA-USD', '2020-06-21', '2020-12-17')
One major feature of vectorbt is that it places a strong focus on data science, and so it allows us to apply popular analysis tools to almost any part of the backtesting pipeline. For example, let’s explore how the number of exit signals depends upon the stop type and value:
Name: avg_num_signals, dtype: float64
We see that TS is by far the most occurring exit signal. The SL and TP curves come hand in hand up to the stop value of 50% and then diverge in favor of TP. While it might seem that bulls are mostly in charge, especially for bigger price movements, remember that it is much easier to post a 50% profit than a 50% loss because the latter requires a 100% profit to recover; thus, negative downward spikes seem to dominate small to medium price movements (and shake out weak hands potentially). These are well-known cryptocurrency dynamics.
To simplify the analysis that follows, we should ensure that each column has at least one exit signal to close the position, which means that if a column has no exit signal now, it should get one at the last timestamp. This is done by combining the stop exits with the last-bar exit using the OR rule and selecting the one that comes first:
Name: avg_num_signals, dtype: float64
Next, we will generate signals of the two remaining exit types: random and holding — they will act as benchmarks to compare SL, TS, and TP against.
“Holding” exit signals are signals placed at the very last bar of each time series. On most occasions, we shouldn’t bother ourselves with placing them, since we can simply assess open positions. The reason we do it anyway is consistency — we want to ensure that each column has (exactly) one signal. The other consideration is shape and columns: they should match that of stop signals so we can concatenate all DataFrames later.
To generate random exit signals, just shuffle any signal array. The only requirement is that each column should contain exactly one signal.
The last step is the concatenation of all DataFrames along the column axis:
exits array now contains 2,000,000 columns — one per backtest. The column hierarchy is also complete — one tuple of hyperparameters per backtest.
( 'SL', 0.01, 'BTC-USD', '2017-12-31', '2018-06-28'),
( 'SL', 0.01, 'BTC-USD', '2018-01-02', '2018-06-30'),
( 'SL', 0.01, 'BTC-USD', '2018-01-05', '2018-07-03'),
('Holding', 1.0, 'ADA-USD', '2020-06-16', '2020-12-12'),
('Holding', 1.0, 'ADA-USD', '2020-06-19', '2020-12-15'),
('Holding', 1.0, 'ADA-USD', '2020-06-21', '2020-12-17')
This allows us to group signals by one or multiple levels and conveniently analyze them in one go. For example, let’s compare different exit types and stop values by an average distance of exit signal to entry signal (in days):
This scatterplot gives us a more detailed view of the distribution of exit signals. As expected, exit signals of plain holding have an exact distance of 179 days after entry (maximum possible), while random exit signals are evenly distributed over the time window and are not dependent upon any stop value. But we are more interested in stop curves, which are flat and thus hint at high volatility of price movements within our timeframe — the lower the curve, the higher is the chance of hitting a stop. To give an example, a TS of 20% is hit after just 30 days on average, while it would take 72 days for SL and 81 days for TP. But does an early exit any good?
Here comes the actual backtesting part:
Fairly easy, right?
The simulation took 4 minutes on my MacBook Air to finish and generated in total 3,995,570 orders that are ready to be analyzed (should be 4 million, but some price data points seem to be missing). Notice, however, that any array produced by the portfolio object of the same shape as our exit signals, such as portfolio value or returns, requires 8 * 180 * 2000000 bytes or almost 3GB of RAM, and it is automatically cached to be re-used by other portfolio components. Thus, we will disable caching to release memory as soon as the calculation of portfolio performance is over:
We can analyze anything from trades to Sharpe ratio, but given the amount of data, we will stick to a fast-to-calculate metric — total return.
If your computer takes a substantial amount of time to simulate, you have several options:
- Use Google Colab
- Reduce the number of stop values (e.g., from 1% to 2%)
- Cast to
np.float32or even below (if supported)
- Split the exit signal array into chunks and simulate per chunk. Just make sure each chunk has a shape compatible with that of the price and entries (remember to delete the previous portfolio if simulated):
100%|██████████| 5/5 [01:19<00:00, 16.98s/it]
That’s much better.
The first step is always taking a look at the distribution of the baseline:
Name: Holding, dtype: float64
The distribution of holding performance across time windows is highly left-skewed. On the one hand, this indicates prolonged sideways and bearish regimes within our timeframe. On the other hand, the price of any asset can climb to infinity but is limited by 0 — making the distribution denser on the left and more sparse on the right by nature. Every second return is a loss of more than 7%, but thanks to bull runs the strategy still manages to post an average profit of 7%.
Let’s include other strategies into the analysis:
Mean Median Std
SL 0.035266 -0.168662 0.750711
TS 0.038908 -0.112648 0.685156
TP 0.038539 0.070077 0.477094
Random 0.016484 -0.083610 0.570806
Holding 0.069122 -0.148971 0.814248
None of the strategies beat the average return of the baseline. The TP strategy is the most consistent one though — although it introduces an upper bound that limits huge profits (see missing outliers), its trade returns are less volatile and mostly positive. The reason why SL and TS are unbounded at the top is that some of the stops haven’t been hit, and so their columns fall back to plain holding. The random strategy is also interesting: while it’s inferior in terms of average return, it finishes second after TP in terms of median return and returns volatility.
To confirm the picture above, let’s calculate the win rate of each strategy:
Name: win_rate, dtype: float64
Almost 57% of trades with TP are profitable — a high contrast to other strategies. But having a high win ratio doesn’t necessarily guarantee longer-term trading success if your winning trades are often much smaller than your losing trades. Thus, let’s aggregate by stop type and value and compute the expectancy:
Each strategy is able to add gradually to our account in the long run, with the holding strategy being the clear winner here — we can expect to add to our account an average of $7 out of $100 invested after every 6 months of holding. The only configuration that beats the baseline is TS with stop values ranging from 20% to 40%. The worst-performing configuration is SL and TS with stop values around 40% and 55% respectively; both seem to get triggered once most corrections find the bottom, which is even worse than exiting randomly. The TP strategy, on the other hand, beats the random exit strategy after the stop value of 30%. Generally, waiting seems to pay off for cryptocurrencies.
Finally, let’s take a look at how our strategies perform under different market conditions. We will consider a simplified form of regime classification that divides holding returns into 20 bins and calculates the expectancy of each strategy within the boundaries of each bin (we leave out the latest bin for the sake of chart readability). Note that due to the highly skewed distribution of holding returns, we need to take into account the density of observations and make bins equally-sized.
The chart above confirms the general intuition behind the behavior of stop orders: SL and TS limit the trader’s loss during downtrends, TP is beneficial for short-term traders interested in profiting from a quick bump in price, and holding performs best in top-growth markets. Surprisingly, while random exits perform poorly in sideways and bull markets, they match and often outperform stop exits in bear markets.
Bonus: Jupyter dashboard
Dashboards can be a really powerful way of interacting with the data.
First, let’s define the components of our dashboard. We have two types of components: controls, such as asset dropdown, and graphs. Controls define parameters and trigger updates for graphs.
The second step is the definition of the update function, which is triggered once any control has been changed. We also manually call this function to initialize the graphs with default parameters.
In the last step, we will define the layout of the dashboard and finally run it:
The use of large-scale backtesting is not limited to hyperparameter optimization, but when properly utilized, it gives us a vehicle to explore complex phenomena related to trading. Especially utilization of multi-dimensional arrays, dynamic compilation, and integration with pandas, as done by vectorbt, allows us to quickly get new insights by applying popular data science tools to each component of a backtesting pipeline.
In this particular example, we conducted 2 million backtests to observe how different stop values impact the performance of stop signals and how different stop signals compare to holding and trading randomly. On the one hand, the findings confirm what we already know about the behavior of stop signals under various market conditions. On the other hand, they reveal optimal configurations that might have worked well for the last couple of years of trading cryptocurrencies.
But there is always more to it than a couple of charts we drew above. If you’re curious to apply the analysis on your own data or to try different hyperparameters, feel free to run the notebook.