Updated: Apr 1
We will continue adapting the methods presented in Marcos Lopez de Prado. 2018. Advances in Financial Machine Learning (1st. ed.). Wiley Publishing. We have the standard fractional differentiation and the fixed-width window differentiation implemented in our code. These two procedures cover generating statistically sound time-series data that conserve the maximum possible amount of memory. These two procedures can cover the generation of features; we now need to define the target values or labels.
We are using the triple barrier method for this, adapted to our needs. For each point in time, we will determine the label of a future point t days into the future with three possible situations:
(1) if before the time limit, the price of the instrument reaches a certain level of positive returns.
(-1) if before the limit, it hits a certain level of negative returns.
(0) if the time expires without touching any of the returns limits.
This is a diagram for a simple case in which the lower limit is hit first. Either our long position was potentially stopped by a stop-loss (or any other loss control method), or our short position was profitable enough to close.
We will apply this labeling method to all the points in the series. First, we will define a function to obtain the exponentially weighted standard deviation for a price series's returns. We will use this function later to set expected return limits using a measure of realized volatility. We will have to be careful with this function as it will compute the volatility of the complete series, potentially leaking some information into our labeling system. The function is relatively simple, sets a default exponential span of 100 for the calculation of the decay:
def compute_vol(df: pd.DataFrame, span: int=100) -> pd.DataFrame: ''' Compute period volatility of returns as exponentially weighted moving standard deviation: Args: df (pd.DataFrame): Dataframe with price series in a single column. span (int): Span for exponential weighting. Returns: pd.DataFrame: Dataframe containing volatility estimates. ''' df.fillna(method='ffill', inplace=True) r = df.pct_change() return r.ewm(span=span).std()
We will flexibly define the labeling function:
def triple_barrier_labels( df: pd.DataFrame, t: int, upper: float=None, lower: float=None, devs: float=2.5, join: bool=False, span: int=100) -> pd.DataFrame: ''' Compute the triple barrier label for a price time series: Args: df (pd.DataFrame): Dataframe with price series in a single column. t (int): Future periods to obtain the lable for. upper (float): Returns for upper limit. lower (float): Returns for lower limit. join (bool): If True, the input dataframe and the labels are returned joined. span (int): Span for exponential weighting. dev (float): Standard deviations to set the upper and lower return limits when no limits are passed. Returns: pd.DataFrame: Dataframe containing labels and optinanlly (join=True) input values. '''
The function will take in the dataframe with the price in a single column, the future period t to obtain the label for, that is, the width of the time window. It takes upper and lower as returns limits; if we want to compute these limits using realized vol, we will not pass any or all limits and set the standard deviations that we consider are our top and lower limits. This allows us to flexibly set upper or lower barriers as fixed values or as computed values, or a mixture of both.
Some preliminary testing and handling of input data:
if t < 1: raise ValueError("Look ahead time invalid, t<1.") df.fillna(method='ffill', inplace=True) lims = np.array([upper, lower]) labels = pd.DataFrame(index=df.index, columns=['Label']) returns = df.pct_change()
The window cannot be smaller than 1, so we will interrupt the function. In case there are missing data points, these are filled forward. "lims" variable will remember if the limits were passed as an input or not. Then we initialize the dataframe for results and compute the returns for the input prices.
It is time for the function to iterate over all the values:
r = range(0, len(df)-1-t) for idx in r: s = returns.iloc[idx:idx+t] minimum = s.cumsum().values.min() maximum = s.cumsum().values.max() if not all(np.isfinite(s.cumsum().values)): labels['Label'].iloc[idx] = np.nan continue if any(lims == None): vol = compute_vol(df[:idx+t], span) if upper == None: u = vol.iloc[idx].values*devs else: u = upper if lower == None: l = -vol.iloc[idx].values*devs else: l = lower valid = np.isfinite(u) and np.isfinite(l) if not valid: labels['Label'].iloc[idx] = np.nan continueif any(s.cumsum().values >= u): labels['Label'].iloc[idx] = 1 elif any(s.cumsum().values <= l): labels['Label'].iloc[idx] = -1 else: labels['Label'].iloc[idx] = 0
The cumulative sums of the returns will tell us if, for this specific window, we are breaching the limits. For those values that cannot be computed due to lack of data, we will add NaN as a label, then, if we need any of the limits to be set by computing volatility, we do so, using just enough data to minimize future value leak into the label determination. For most equity prices, volatility can be assumed stationary so that we are not distorting the mean values in the standard deviation calculation too much. Just a little.
Finally, if joining is needed, join both input and labels:
if join: df = df.join(labels) return df return labels
The resulting dataframe looks like this, for the joined case, without joining, there would only be a "Label" column:
We can obtain the label plots now for any possible combination of look-ahead windows and limits, both as fixed percentage returns and as multiples of volatility. For example, for the SPY close prices, 5 days into the future and 3 standard deviations:
For each date, we have a label (0, 1, or -1) indicating what the future state for that point would be, an indication of what side of a trade to take, or if 0, not to take.
The model can generate asymmetric labels, for example, for a 15-period look-ahead and a gain of 5% with volatility controlled loss:
We have the possibility now to generate, for a single instrument at a time, a grid of labels by look-ahead and returns limits. We can potentially cover the complete range of returns and time limits to obtain all possible trade conditions looking for predictability. We will find the historically best holding and limits condition that will inevitably be overfitted in the process. We will continue this line, in a future post, with a model that selects the most promising (in a broad sense) of the labeling parameters using information entropy.
Code is also here. The work is still in progress, and the code may not exactly match this post; always consider Github code the latest and best.
Information in ostirion.net does not constitute financial advice; we do not hold positions in any of the companies or assets that we mention in our posts at the time of posting. If you require quantitative model development, deployment, verification, or validation, do not hesitate and contact us. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, trading, or risk evaluations.
The functions used in this post, as part of a notebook: