top of page

Pattern Recognition in Sector ETFs: Industry Return Predictability - Machine Learning Approach

Our last publication regarding pattern recognition using sectorial ETFs did not yield outstanding results. The dependencies among these ETFs do not manifest themselves easily and did not allow for a simplistic prediction; at least, we could not get ourselves to obtain one. Further research has led us to this publication:

This publication concludes that: "Controlling for post-selection inference and multiple testing, in-sample results provide extensive evidence of industry return predictability, pointing to the existence of industry-related information frictions in the equity market." We are looking for exactly that predictability that describes the wheel of the economy as a cyclic pattern.

The paper uses the database on Dr. French´s web page. We will keep using those ETFs that track (or attempt to track) sector returns and try to replicate and reconcile the research paper's predictive results with our own previous work. We will check whether information across sectorial returns persists robustly with independence from the "exact" underlying assets employed. We will use this notebook published by Alpha Architect in 2018 as a guideline; we will simplify it by not replicating the research paper's exact results. We will try to reproduce the strategy in a practical manner discarding some scientific rigor in the process. We will proceed slowly and publish the notebook at each stage; we will start with the data set-up and by obtaining the LASSO regression values.

The initial modules we will use are:

self = QuantBook()
import numpy as np
import pandas as pd
from IPython.display
import display
from sklearn.linear_model import LassoLarsIC, LinearRegression, Ridge
from sklearn.metrics import r2_score

We will import the ridge regression module for future comparison purposes. After comparing the ordinary least squares regressions with the LASSO candidates and with all candidates, we can run a cross-validation ridge regression to understand LASSO feature selection's effects better, including cross-validation effects.

These are the ETFs we are going to use, with the "common" name to align them easier with sector risk categories as described by Dr. French´s industry portfolios; of course, these do not match one to one:

sector_ETF = ['XLK', 'XLY', 'XLC', 'XLB', 'XLV', 'XLP',
              'XLI', 'XLU', 'XLF', 'XLE', 'XHB']

names = {'XLK':'Tech', 'XLY':'Disc', 'XLC':'Commns',
         'XLB':'Mats', 'XLV':'Health', 'XLP':'Staples',
         'XLI':'Ind', 'XLU': 'Util', 'XLF': 'Fin',
         'XLE': 'Energy', 'XHB': 'Home'}

Energy Select Sector SPDR Fund (XLE)
Materials Select Sector SPDR ETF (XLB)
Industrial Select Sector SPDR Fund (XLI)
Consumer Discretionary Select Sector SPDR Fund (XLY)
SPDR S&P Homebuilders ETF (XHB)
Consumer Staples Select Sector SPDR Fund (XLP)
Health Care Select Sector SPDR Fund (XLV)
Financial Select Sector SPDR Fund (XLF)
Technology Select Sector SPDR Fund (XLK)
Communication Services Select Sector SPDR Fund (XLC)
Utilities Select Sector SPDR Fund (XLU)

start =  datetime(2000, 1, 1)
end = datetime(2015, 1, 1)

ETF_symbols = {etf: str(self.AddEquity(etf).Symbol.ID) for etf in sector_ETF}

history = self.History(self.Securities.Keys, start, end, 

The history will run from 2000 to 2015. Most of these ETFs started trading in 1998; with these date limits, we have a sufficient amount of data for our research, and we can reserve 2015 onwards for validation through backtesting. For this history, we will get the daily returns and resample them at a monthly frequency. We will keep just the 3 first characters for each symbol identifier for easy recognition of the ETF:

returns = history['close'].unstack(level=0).pct_change()
returns.columns = list(pd.Series(returns.columns).map(lambda x: names[x[:3]]))
monthly_returns = returns.resample('M').sum()
features = monthly_returns.columns

The monthly returns look like this:

The returns are very volatile at the "head" of the returns; the dotcom technology bubble was bursting, the volatility at technology was exceptionally high during this period. It is worth checking the returns and volatilities for these instruments monthly and yearly to confirm that the numbers make sense a priori.

For the monthly returns and volatility:

And for the yearly time frame:

These data frame views are obtained with the display functions below; volatility can also be computed yearly directly, instead of by annualizing the monthly volatilities, the difference is usually minimal:

isplay(pd.DataFrame(returns.resample('M').sum().mean(axis=0), columns = ['Monthly - Mean Returns']))
display(pd.DataFrame(returns.resample('M').sum().std(axis=0),  columns = ['Monthly - Volatility']))
display(pd.DataFrame(returns.resample('Y').sum().mean(axis=0), columns = ['Year - Mean Returns']))
display(pd.DataFrame(returns.resample('M').sum().std(axis=0),  columns = ['Yearly - Volatility'])*np.sqrt(12))

The returns we have found will be the features for our machine learning models. The targets we are trying to predict are the returns one month into the future. We will initially limit our predictions to this 1-month-ahead horizon; the model can always be extended to additional future windows at additional computation costs.

targets = []
for col in monthly_returns.columns:
    name = col+'_Future'
    monthly_returns[name] = monthly_returns[col].shift(-1).dropna()

We bring a small innovation to the research publications by using cross-validation in our LASSO model selection step. There are drawbacks to using cross-validation to estimate the LASSO model parameters; we may end up with a model that leaves out useful, good features. We can compare the LASSO with cross-validation to that without cross-validation and even obtain a mixture of both in the future if needed. The approach will be to use a time series split cross-validation into our LASSO and record to a dictionary the selected model features:

from sklearn.linear_model import LassoCV
from sklearn.model_selection import TimeSeriesSplit
import warnings
tscv = TimeSeriesSplit()

lasso_cv = LassoCV(cv = tscv, max_iter = 10000)
lasso_cv_results = pd.DataFrame(0, index = targets, columns=features)
for target in targets:
    X = monthly_returns[features]
    y = monthly_returns[target], y)
    lasso_cv_results.loc[target] = np.round(lasso_cv.coef_,5)
    score = lasso_cv.score(X, y)
    print("Score for {} - {}".format(target, score))

The intermediate data frame with the LASSO results is this, showing the relevant sectors that predict the target as a non-zero parameter:

We will "list-comprehend" this data frame into a dictionary for easy reading of the sector returns that can be used to predict the future of another sector according to the LASSO model:

cv_candidates = {}
for target in targets:
    comp = [feature for feature in lasso_cv_results.loc[target].index if lasso_cv_results.loc[target, feature] != 0]
    cv_candidates[target] = comp

The resulting dictionary contains the predictors for each sectorial ETF:

{'Home_Future': ['Home'],
 'Mats_Future': [],
 'Energy_Future': ['Energy', 'Fin', 'Tech'],
 'Fin_Future': [],
 'Ind_Future': [],
 'Tech_Future': [],
 'Staples_Future': [],
 'Util_Future': ['Home', 'Energy', 'Fin', 'Ind', 'Tech', 'Health'],
 'Health_Future': [],
 'Disc_Future': ['Tech']}

The dictionary leaves us with a very, very sparse model with factors of dubious usability. Home sector is auto-regressive, Energy also, with financials and technology predicting it to a lesser degree. The Utilities sector contains a large number of candidate predictors and not themselves. In any case, the comparison with the non-cross-validated LASSO model is this:

{'Home_Future': [],
 'Mats_Future': [],
 'Energy_Future': ['Home', 'Energy', 'Fin', 'Ind', 'Tech', 'Staples',
 'Fin_Future': ['Home', 'Energy', 'Fin', 'Tech', 'Health'],
 'Ind_Future': ['Energy', 'Fin', 'Tech', 'Health'],
 'Tech_Future': [],
 'Staples_Future': ['Mats', 'Energy', 'Fin', 'Tech', 'Util', 'Health'],
 'Util_Future': ['Home', 'Energy', 'Fin', 'Ind', 'Tech', 'Health'],
 'Health_Future': [],
 'Disc_Future': ['Energy', 'Tech']}

We will believe and use the cross-validated model, for the time being, we can always come back to this LASSO model and investigate in-depth the resulting predictors. With the predictors in place now, it is just a matter of fitting the regression using them as features.

ols = LinearRegression()

for target in targets:
    candidate_features = cv_candidates[target]
    if len(candidate_features) < 1: 
        print('No features for {}. Skipping.'.format(target))
    X = monthly_returns[candidate_features]
    y = monthly_returns[target], y)
    y_pred = ols.predict(monthly_returns[candidate_features])
    print ("In-sample OLS R-squared: %.2f%%" % (100 * r2_score(y, y_pred)))

Results for the R-square of predictions in linear regressions are not easy to interpret. It seems that the LASSO selected variables do a generally poor job at predicting the future, and in general, the more feature variables, the better the coefficient of determination:

In-sample OLS R-squared: 0.95%
No features for Mats_Future. Skipping.
In-sample OLS R-squared: 5.73%
No features for Fin_Future. Skipping.
No features for Ind_Future. Skipping.
No features for Tech_Future. Skipping.
No features for Staples_Future. Skipping.
In-sample OLS R-squared: 14.14%
No features for Health_Future. Skipping.
In-sample OLS R-squared: 3.32%

For comparison purposes, these are the same values when fitting the linear regression to all available features, not only those selected by the LASSO:

In-sample OLS R-squared: 3.78%
In-sample OLS R-squared: 7.63%
In-sample OLS R-squared: 10.22%
In-sample OLS R-squared: 7.56%
In-sample OLS R-squared: 9.46%
In-sample OLS R-squared: 4.83%
In-sample OLS R-squared: 9.79%
In-sample OLS R-squared: 14.54%
In-sample OLS R-squared: 7.23%
In-sample OLS R-squared: 8.31%

Which may, at this point, indicate nothing more than overfitting to a multitude of noisy variables. We will abstain from further interpreting the resulting model and use this LASSO and OLS regression as a backbone for an initial backtest. In future publications, we will fit this model dynamically inside a backtest and check our returns by blindly following the generated target predictions. The research notebook with the initial steps is attached at the very end below.

Information in does not constitute financial advice; we do not hold positions in any of the companies or assets that we mention in our posts at the time of posting. If you require quantitative model development, deployment, verification, or validation, do not hesitate and contact us. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, trading, or risk evaluations.

51 views0 comments

Recent Posts

See All


bottom of page