top of page

Entropy Based Label Selection

Now we have a tool that will flexibly obtain labels for our hypothetical trading positions, all of them inside a certain space of future times and with various possible return limits, both up and down. Inside this space, we can take two separate approaches, either we have external requirements for selecting the future windows and limits, or we look for a labeling set that suits other requirements. For the first approach, take whatever time windows we want to invest at, our returns expectation and stop-loss limits (or standard deviations), and feed those into a machine learning model. Most if not all of these requirements for gains, limits, and timeframes may be selected externally to the model, by fitting previous models or by using common knowledge. This is probably a source of model vias as we are forcing a timeframe, or gains or loss-limits.


In this post, we propose a method for label selection based on the distribution of the labels themselves, opposed to an apriori set of times and limits. This will enable us to train or develop our models for different data distribution scenarios using model performance (model accuracy) as the central requirement.


The first approach we propose here is an entropy-based label selection. We will cover as much space in the possible windows and return limits and look for sensible entropy values: the local extrema for entropy within a range.


We will add our entropy results to a dictionary for easy inspection:

future_space = np.linspace(2, 120, 20, dtype=int)
labels = {}

We will compute the entropy of those labels' distribution for a given set of tripe barrier labels. The entropy, information entropy described by Shannon in "A Mathematical Theory of Information," will give us a view on how uncertain the distribution is and point us towards the model that can better predict the dataset for both sparse labels and balanced labels. For this task, we will be using scipy.stats entropy calculator.


To obtain a full set of parameterless labels, we will apply our future space to a volatility-based (default) barrier generation. We are expecting 20 entropy points for future values of 2 to 120 days. We will add these entropy values to a counting dictionary:

for fut in future_space:
    labels[fut] = fractio.triple_barrier_labels(df, fut)

counts = {}
for fut in labels.keys():
    s = labels[fut].squeeze()
    counts[fut] = s.value_counts(normalize=True)

Now we compute the entropies for each of the sets of label distributions:

entropies = {}
for f, c in counts.items():
    entropies[f] = entropy(c)

Computing the triple barriers is quite intensive. After a few minutes, we can plot the shape of the results for each future window in the future space:

s = pd.Series(entropies, index=entropies.keys())
p = pd.DataFrame(s)
p.columns = ['Entropy']
rh.plot_df(p, y_label='Entropy')

Here we jump from series to dataframe for the sole purpose of using our own research helper, included in the test algorithm below. The shape of the entropy curve for triple barrier labels using a standard deviation of 2.5 is this:

Triple Barrier Label entropy as a function of look-ahead days.

This plot corresponds to the following distribution of labels, days into the future as columns and labels in the index:


For shorter future frames, no limit hit is prevalent. Label 0, the prevailing label-short term, means no significant movement so that the entropy is very low. Predict 0 always, and you win in terms of simple prediction accuracy.


For longer holding periods, around 80 days and up, entropy becomes low, and the positive stock returns bias enters play. Chances are very high that you will hit the upper volatility limit before the lower volatility limit and reenter a bull position. This leads to a twitching buy-and-hold strategy that ignores that 7% of cases in which you hit the stop-loss barrier and take a hit to equity. We can simulate this strategy and surely be as profitable and risky as the stock market itself, hold SPY and wait. This makes for a very unexciting and uninformed strategy.


The interesting part is in the 14-day window, where the labels are difficult to predict as they are evenly distributed. As the entropy of the labels is maximum, a classification model trained on this data will, in theory, have the hardest time predicting the label, and this is then where the maximum potential for beating the market is. Historically speaking, and for SPY in this case, it is difficult to predict the direction at 2.5 standard deviations of returns 14 days forward. Instead of setting an arbitrary holding period for our positions, we can use this maximum entropy point. We can also do this for all possible combinations of expected returns and allowable stop losses. We will discover the most evenly distributed labels for all these cases and the most "difficult" time-frames to trade.


All this process can be subsumed in a single function calling our previously defined functions:

def get_entropic_labels(df: pd.DataFrame,
               side: str = 'max',
               future_space: np.linspace = np.linspace(2,90,40, dtype=int),
               tbl_settings: dict = {}) -> pd.DataFrame:
    '''    
    Compute the series of triple barrier labels for a price series that     
    results in the maximum or minimum entropy for label distribution.        
    Args:        
    df (pd.Dataframe): Dataframe with price series in a single column.                
    side (str): 'max' or 'min' to select maximum or minimim entropies.                    
    'min' entropy may not result in usable data.
    future_space (np.linspace): Space of future windows to analyze.        
    tbl_settings (dict): Dictionary with settings for 
    triple_barrier_labels function.
    Returns:
            pd.DataFrame: Dataframe with the selected entropy        
            distribution of labels.
    '''
    if side not in ['max', 'min']:
        raise ValueError("Side must be 'max' or 'min'.")
    
    # Labels:
    l = {}
    for f in future_space:
        # Check this for references:
        l[fut] = triple_barrier_labels(df, f, **tbl_settings)
        
    # Counts:
    c = {}
    for f in l.keys():
        s = l[f].squeeze()
        c[f] = s.value_counts(normalize=True)
    
    # Entropies:
    e = {}
    for f, c in c.items():
        e[f] = entropy(c)
    
    # Maximum and minimum entropies:
    max_e = [k for k,v in e.items() if v == max(e.values())][0]
    min_e =  [k for k,v in e.items() if v == min(e.values())][0]
    
    if side == 'max':
        e_labels = l[max_e]
        t = max_e

    if side == 'min':
        e_labels = l[min_e]
        t = min_e

    e_labels.columns = ['t_delta='+str(t)]
    return e_labels     

Driving through this function the historic daily close prices for SPY from 2010 to 2020 and with a target for winning 5% with a stop loss at 2.5%, we obtain a maximum entropy triple barrier label distribution for a 35-day prediction, with this distribution plot:


Same values for a 2% win tolerating just a 1% loss yields 8 days into the future as maximum entropy time frame:

Labels for a given set of returns and stop-loss resulting in a maximum entropy future of 8 periods.

With this tool in our box, in our next post, we will feed this data into spot, unsophisticated, machine learning models to gain insights on how difficult the prediction becomes at the highest difficulty level using only market data as features, using a purely technical model.


Information in ostirion.net does not constitute financial advice; we do not hold positions in any of the companies or assets that we mention in our posts at the time of posting. If you require quantitative model development, deployment, verification, or validation, do not hesitate and contact us. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, trading, or risk evaluations.


The model and examples used in this post:

Also, functions are getting added to this repository.


113 views0 comments

Recent Posts

See All
bottom of page