top of page

Why Pairs Trading Might Not Work: Rolling Cointegration Tests

Pairs trading is the dean, the eldest and better known of modern quantitative trading models. It is the pinnacle of statistical arbitrage as it is soundly based on fundamental sectorial market reactions and short-term mean reversion of prices. There are multiple sources that explain pairs trading, the Wikipedia article is presented here as a neutral explanation. There are multiple non-neutral explanations for pairs trading strategies, both saying for, it works, and against, it does not work. Other sources claim that it did work, in the past, before widespread use of computers, and no longer does. Some even go as far as suggesting that it has been the sole cause of financial disaster for several large investors.

There are also many sources that explain the algorithmic procedure behind pairs trading. Sources explain the procedure using Python are here and here, shown only as examples. The strategy seems to be gaining popularity with the rise of simplified back-testing tools and the high availability of automated trading tools. It is an appealing strategy due to the incontestability of its fundamental mechanism and the promise of deleting the market risk in a single strike, two trades and your beta risk is gone.

We are taking a neutral stance, as usual, and are not going to condemn or recommend pairs trading, we are only going to comment on the first source of bias that is found in these types of strategy: the selection bias in the pair. Most examples that can be found on the internet will start with two supposedly, fundamentally selected, pair for pairs trading. Usually these pairs reside in the same industry, are two instruments for the same underlying asset or any other valid fundamental relation. The premise is that these two instruments will mean revert in their relative price spread. As an illustration we are going to use the airlines sector as a sector that should exhibit a shared fate among members and should produce candidates for pairs trading. Airlines is also a relatively high-risk sector, exposed to mega-bankruptcies and with high technical demands on their capitals that could lead to large drops in individual companies. We are adding the JETS ETF into the mix to check for possible components-aggregator cointegration:

import researchhelpers as rh
import statsmodels.api as sm
import statsmodels.tsa.stattools as stats
from tqdm.autonotebook import tqdm

self = QuantBook()
airlines = ['ALK', 'AAL', 'DAL', 'LUV', 'UAL', 'JETS']
symbols = [self.AddEquity(ticker).Symbol for ticker in airlines]
start = datetime(2020,1,1)
end = datetime(2021,1,1)
history = self.History(symbols, start, end, Resolution.Daily)['close'].unstack(level=0)

We are checking 2020 to 2021 because we might see a good cointegration magnified by the COVID19 crisis that should affect all the industry in a similar way and amplify the cointegration in the period. The period looks like this price-wise:

These prices seem cointegrated by sight; will they withstand a statistical test? We will use, for this period, the cointegration test from statsmodel library. This test will check for the stationarity of the regression of each time series against the other, it will tell us if the effect is statistically significant as to have two time series that their linear combination retains mean and variance in time:

from statsmodels.tsa.stattools import coint
import itertools

pairs = list(itertools.combinations(history.columns, r=2))

def coint_values(df):
    pairs = tuple(itertools.combinations(list(df.columns), r=2)) 
    scores = {}
    p_values = {}
    crit_values = {}
    for pair in tqdm(pairs):
        print("Cointegration of {}-{}".format(pair[0], pair[1]) )
        values = df[[pair[0]]].join(df[[pair[1]]]).dropna()  
        result = coint(values[pair[0]], values[pair[1]])
        scores[pair] = result[0] 
        p_values[pair] = result[1]
        crit_values[pair] = result[2][2] #Result index 2, at 0 - 1%
    scores = pd.DataFrame(list(scores.values()),
    scores.columns = [col[1] for col in scores.columns]
    p_values = pd.DataFrame(list(p_values.values()),
    p_values.columns = [col[1] for col in p_values.columns]
    crit_values = pd.DataFrame(list(crit_values.values()),
    crit_values.columns = [col[1] for col in crit_values.columns]
    return pairs, scores, p_values, crit_values

The results for our airliners and ETF are this, in cointegration and p-values heat-map mode:

pairs, scores, p_values, crit_values = coint_values(history)
rh.plot_hm(scores, title = 'Cointegration Scores', x_rot=45)
rh.plot_hm(p_values, title = 'Cointegration p-values', x_rot=45)

Lowest p-value in the period was for United Airlines (UA) against American Airlines (AAL). The rest of the airlines seem not to form cointegrated pairs. Good, let's start trading the difference between these two now, the UA-AAL pair. This is where most pairs trading strategies induce bias; past cointegration for a biased period does not indicate future cointegration. As we start trading differences between this two companies will manifest, and without a more robust method, we can end up placing multiple losing trades that we thought were market neutral and bound to mean-revert.

We can increase the robustness of our belief in the cointegration of these two companies, or any other, by rolling back the cointegration check. How far and how often are companies cointegrated for various look-back windows? If we have a set of past behaviors at different resolutions we could be more confident on the cointegration of the pair.

We need two heavy helpers:

def compute_moving_coint(df, window):
    scores = []
    p_values = []
    crit_values = []
    indexes = []
    for i in tqdm(range(int(len(df)/window))):
        start = i*window
        end = start + window
        sample = df.iloc[start:end]
        p, s, p_val, c_val = coint_values(sample)
    scores = pd.Series(scores, index=indexes)
    crit_values = pd.Series(crit_values, index=indexes)
    return scores, crit_values, indexes

def compute_moving_set(df, window):
    pairs = tuple(itertools.combinations(list(df.columns), r=2)) 
    scores = {}
    crits = {}
    indexes = {}
    for pair in tqdm(pairs):
        values = df[list(pair)].dropna()
        scores[pair], crits[pair], indexes[pair] = compute_moving_coint(values, window)
    return pd.DataFrame(scores), pd.DataFrame(crits)

With these two functions we will compute first moving window cointegrations from the start of the series forward and second a set of cointegrations for multiple moving windows. The resolution will be, for simplicity, the window size. Running the full model:

agg_scores = {}
agg_crits = {}
windows = [22,66,66*2,66*4]
for window in tqdm(windows):
    agg_scores["Score_"+str(window)], agg_crits["Crit_"+str(window)] = compute_moving_set(long, window)

After a long while, we have a set of dictionaries keyed by moving window with the cointegration of all pairs we defined. We can find the behavior, in terms of cointegration, of the UA-AAL pair in time for a longer period, using 4 years of data in this case:

def get_short_name(name, string_length=19):
    return name[0:len(name)-string_length]
company_A = long.columns[0]
name_A = get_short_name(company_A)
company_B = long.columns[5]
name_B = get_short_name(company_B)
roll = windows[1]

rolling_threshold = agg_scores['Score_'+str(roll)][company_A][[company_B]].join(
    agg_crits['Crit_'+str(roll)][company_A][[company_B]], rsuffix='_Critical')
           y_label='Cointegration Statistic',
           title='Rolling Cointegration {} Days {}'.format(str(roll), name_A))

For a 66-day rolling window:

And 132-day rolling window:

Now our cointegration of UA and AAL is more difficult to justify and if we imagine ourselves rolling back in time along the window we would use it as a pair to quickly discard it.

The notebook below plots all possible combinations. It also allows changing the airlines industry to any other set of tickers for analysis. To better read and run the research notebook clone the (empty) algorithm in Quantconnect:

Depending on when we look, and how far we look, the cointegration of UAL and AAL appears and vanishes. There seems to be a very low persistence of statistically discoverable cointegration, at least while using the Engle - Granger test with the airlines industry in this period. This is but one example, or one example of a typical procedure, to determine or reject pairs, and it seems that it would have a lot of difficulties in finding pairs that sustain their cointegration for long after it is formed.

The strategy returned in excess in the past, at least according to history. It is now possible that the effects driving it are no longer present, that the recent rise in algorithmic trading has completely masked it or that there are other, more powerful fundamental reasons at play from a higher granularity and availability of information. The effects may still be there, it may require extensive pairs seeking and monitoring at higher resolutions, and even if there are pairs to be found, their apparent lack of persistence could result in a high number of bad positions. It is difficult to justify the benefits of pair trading strategies not because its fundamental mechanism is flawed but for the difficulty in finding reliable, sustained pairs.

Information in does not constitute financial advice; we do not hold positions in any of the companies or assets that we mention in our posts at the time of posting. If you are in need of quantitative model development, deployment, verification or validation do not hesitate and contact us. We will be also glad to help you with your machine learning or artificial intelligence challenges.

866 views0 comments

Recent Posts

See All


bottom of page