Pairs Trading in Python: Cointegration and Mean Reversion

Every directional strategy has the same hidden bet baked into it: that the market keeps going the way it has been going. Long-only momentum, trend following, even most mean-reversion systems — they all live or die by the overall drift of the market. When 2022 arrives and everything correlated to 1 goes down together, the “diversified” book turns out to have been one trade all along.

Pairs trading in Python is the classic escape hatch. Instead of betting on where the market goes, you bet on the relationship between two securities that historically move together: when the gap between them stretches too far, you bet it snaps back. Done right, the position is market-neutral — it can make money in a bull market, a bear market, or a flat one, because it only cares about the spread between two assets, not the level of either.

The catch is that “two things that move together” is not the same as “two things you can trade against each other.” Correlation is not enough, and trading on correlation alone is one of the most expensive beginner mistakes in statistical arbitrage. The tool that actually matters is cointegration.

By the end of this article you will have:

  • A clear intuition for why cointegration — not correlation — is what makes a pair tradeable.
  • A working statsmodels pipeline that tests a pair and builds a tradeable spread.
  • A market-neutral z-score backtest on real ETF data, with the look-ahead traps named out loud.

1. Why a single-asset strategy is a disguised market bet

Suppose you build a beautiful mean-reversion system on a single stock. It buys dips, sells rips, and the equity curve looks smooth. Then the company’s sector rotates out of favor for eighteen months and the stock grinds down the whole time. Your “mean” was a moving target, and every dip you bought was a falling knife.

The problem is that a single price series has no anchor. There is no law of physics that says a stock has to return to any particular level. Its “fair value” drifts with earnings, rates, sentiment, and a hundred other things you are not modeling.

A pair gives you an anchor. If two companies are in the same business — two soft-drink makers, two oil majors, two country ETFs driven by the same commodities — then whatever macro force pushes one up tends to push the other up too. The difference between them, the spread, has a much better claim to being mean-reverting than either price on its own. When one runs ahead of the other for no fundamental reason, you short the expensive leg, buy the cheap leg, and wait for the gap to close. Your exposure to the market as a whole roughly cancels out.

That cancellation is the whole point. It is also where the rigor has to come in, because not every correlated pair has a stable spread.


2. Cointegration vs correlation (the intuition)

Here is the distinction that trips up most people, with no heavy math.

Correlation measures whether two series move in the same direction day to day. Two random walks can be highly correlated over a window and then wander arbitrarily far apart forever. Correlation says nothing about whether they stay close.

Cointegration is stronger: it says that some linear combination of the two series is stationary — it has a stable mean and reverts to it. The two prices can each wander wherever they like, but they are tied together so that the spread between them keeps coming back.

The standard mental picture is the drunk and her dog. The drunk staggers home on a random walk; her dog wanders on its own random walk. Each path on its own is unpredictable and can go anywhere. But they are joined by a leash. The distance between them is bounded — it stretches and contracts but always pulls back. The two positions are non-stationary; the leash (the spread) is stationary. That leash is cointegration, and it is exactly what you trade.

Crucially, two series can be strongly correlated without being cointegrated (they drift apart for good), and — more rarely — cointegrated without being strongly correlated day to day. For pairs trading, cointegration is the property you need, and there is a formal test for it.


3. Setting up the environment

pip install yfinance statsmodels pandas numpy matplotlib

Imports for the whole article:

import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.stattools import coint, adfuller

4. Getting data and choosing a candidate pair

A good candidate pair should have an economic reason to move together — that is what stops the relationship from being a coincidence that evaporates out of sample. A textbook example is EWA (the iShares Australia ETF) and EWC (iShares Canada). Both economies are commodity-heavy and rate-sensitive, so the two ETFs are pushed around by similar macro forces.

tickers = ["EWA", "EWC"]
prices = yf.download(tickers, start="2010-01-01", end="2025-01-01",
                     auto_adjust=True)["Close"].dropna()

ewa = prices["EWA"]
ewc = prices["EWC"]

Plot them on the same axis and you will see two lines that clearly travel together but are not glued — exactly the profile you want before running any statistical test.

prices.plot(figsize=(14, 6), title="EWA and EWC closing prices")
plt.ylabel("Price (USD)")
plt.show()

Choosing the pair by eye first, test second matters more than it looks. If you skip the economic story and let an algorithm scan thousands of pairs for the lowest p-value, you walk straight into the data-snooping trap covered in section 8.


5. Testing for cointegration

statsmodels ships the Engle-Granger two-step test as a single function, statsmodels.tsa.stattools.coint. The null hypothesis is no cointegration; a small p-value lets you reject it.

score, pvalue, _ = coint(ewa, ewc)
print(f"Engle-Granger cointegration p-value: {pvalue:.4f}")

A p-value below 0.05 is the usual threshold to treat the pair as cointegrated. Do not stop at a single number, though — the test is sensitive to the sample window. A pair that passes on 2010–2025 may fail on 2015–2020. Re-run the test on a few sub-periods before you trust it.

For intuition, it helps to also run an augmented Dickey-Fuller test directly on the spread once you have built it (next section). Engle-Granger is essentially doing that under the hood, but seeing the ADF p-value on the spread you actually trade makes the result concrete.


6. Building the spread and the z-score signal

Cointegration tells you a stationary combination exists; ordinary least squares tells you the hedge ratio that defines it. Regress one leg on the other and the slope is how many units of EWA you hold against one unit of EWC.

X = sm.add_constant(ewa)
ols = sm.OLS(ewc, X).fit()
hedge_ratio = ols.params["EWA"]

spread = ewc - hedge_ratio * ewa

Confirm the spread is stationary with an augmented Dickey-Fuller test — the null here is non-stationary (a unit root), so again you want a small p-value:

adf_stat, adf_p, *_ = adfuller(spread)
print(f"ADF p-value on the spread: {adf_p:.4f}")

Now turn the spread into a tradeable signal. The raw spread has units of dollars and a level that means nothing on its own; what you care about is how stretched it is right now relative to its recent normal. That is a rolling z-score:

window = 30
spread_mean = spread.rolling(window).mean()
spread_std = spread.rolling(window).std()
zscore = (spread - spread_mean) / spread_std

zscore.plot(figsize=(14, 5), title="Spread z-score (EWC vs EWA)")
plt.axhline(2.0, color="r", ls="--")
plt.axhline(-2.0, color="g", ls="--")
plt.axhline(0.0, color="k", ls="-", lw=0.5)
plt.show()

When the z-score spikes above +2, the spread is unusually rich: short it (short EWC, long EWA). When it drops below -2, the spread is unusually cheap: go long it. When it returns toward zero, close the position. Using a rolling mean and standard deviation rather than the full-sample values is deliberate — it keeps the signal backward-looking, which the next sections lean on hard.


7. Backtesting the pairs trading strategy in Python

The trading rules are a clean z-score band: enter at ±2, exit when the spread normalizes back inside ±0.5.

entry, exit = 2.0, 0.5

longs  = zscore < -entry      # spread cheap  -> long the spread
shorts = zscore >  entry      # spread rich   -> short the spread
exits  = zscore.abs() < exit

position = pd.Series(np.nan, index=zscore.index)
position[longs]  = 1
position[shorts] = -1
position[exits]  = 0
position = position.ffill().fillna(0)

Now the part that separates an honest backtest from a marketing chart. The daily profit of holding the dollar-neutral spread is the position (set yesterday) times the change in the spread. The shift(1) is non-negotiable: you can only act on a z-score after the bar that produced it has closed.

spread_ret = spread.diff()
gross = (ewc + hedge_ratio * ewa)          # capital tied up in both legs
strategy_ret = position.shift(1) * spread_ret / gross.shift(1)

equity = (1 + strategy_ret.fillna(0)).cumprod()
equity.plot(figsize=(14, 6),
            title="Market-neutral pairs trading equity curve")
plt.ylabel("Growth of 1 unit")
plt.show()

Dividing the spread PnL by the gross capital of the two legs turns the price-unit profit into a percentage return, so the metrics below are interpretable:

def sharpe(r):
    r = r.dropna()
    return np.sqrt(252) * r.mean() / r.std()

def max_dd(equity):
    peak = equity.cummax()
    return (equity / peak - 1).min()

print("Sharpe :", round(sharpe(strategy_ret), 2))
print("Max DD :", round(max_dd(equity), 3))

What you should expect from a real pair like this: a modest Sharpe, long flat stretches where the spread sits inside the band and you hold nothing, and — the selling point — an equity curve whose shape has very little to do with the S&P 500’s. That low correlation to the broad market is the entire reason to bother. A pairs strategy is not there to beat buy-and-hold on raw return; it is there to add a return stream that does not move with everything else you own.


8. The traps that quietly ruin pairs trades

Pairs trading looks deceptively simple, and the simple version hides several ways to fool yourself.

  • Look-ahead in the hedge ratio. The OLS above is fit on the entire sample, so your 2011 spread was defined using a beta computed with 2024 data. In production you must estimate the hedge ratio on a rolling or expanding window of past data only — see section 9. The same caution is why the z-score uses a rolling window rather than the full-sample mean and standard deviation.
  • Cointegration is not permanent. A pair can be cointegrated for a decade and then decouple — a merger, a regulatory change, a commodity shock, a constituent change in one of the ETFs. Re-test on a rolling window and be ready to retire a pair when the relationship breaks. A blown-up spread that never reverts is the pairs trader’s version of a falling knife.
  • Data snooping when scanning pairs. If you brute-force every pair in a 500-stock universe, that is ~125,000 tests. At a 5% threshold you would expect roughly 6,000 “significant” pairs by pure chance. Picking the lowest p-value out of that pile is overfitting one level up. Either start from an economic hypothesis (as we did) or apply a multiple-testing correction and validate out of sample — the same discipline a walk-forward optimization brings to parameter tuning.
  • Costs and the short leg. A pairs strategy trades both legs and flips often, so commissions roughly double and turnover is high. The short leg also carries borrow costs and can occasionally be hard to borrow. Subtract a realistic round-trip cost — cost * abs(position.diff()) — from the returns and watch how much of the edge survives. Marginal pairs frequently do not survive even 5 basis points.
  • In-sample band tuning. The ±2 / ±0.5 thresholds and the 30-day window are knobs. Tune them on the full history and you are curve-fitting. Choose them a priori or validate them out of sample.

A strategy whose failure modes you can list is one you can manage. A pairs backtest that looks flawless is usually one with the look-ahead left in.


9. Beyond a static hedge ratio

The single biggest weakness above is the fixed, full-sample beta. The relationship between two assets drifts over time, and a static hedge ratio slowly goes stale. Two ways to fix it:

Rolling OLS. Re-estimate the hedge ratio on a trailing window so the spread is always defined by recent history only:

roll_window = 252  # one year
hedge_roll = (ewc.rolling(roll_window)
                 .cov(ewa)
              / ewa.rolling(roll_window).var())
spread_dynamic = ewc - hedge_roll * ewa

This removes the look-ahead and adapts to a drifting relationship, at the cost of a noisier hedge ratio.

Kalman filter. A more elegant approach treats the hedge ratio as a hidden state that evolves smoothly and updates it one observation at a time — no arbitrary window length, and far less jitter than rolling OLS. The pykalman library makes this a few lines; it is a natural follow-up topic in its own right.

For baskets of three or more assets, the Engle-Granger test no longer applies cleanly — reach for the Johansen test (statsmodels.tsa.vector_ar.vecm.coint_johansen), which finds multiple cointegrating relationships at once.


10. Where to go next

A few directions to push this further:

  • Kalman-filter hedge ratios. Replace the static or rolling beta with a Kalman filter for a smoothly time-varying hedge ratio — the standard production upgrade for a pairs book.
  • Half-life of mean reversion. Fit an Ornstein-Uhlenbeck process to the spread to estimate how fast it reverts, then size your z-score window and holding period to match it instead of guessing 30 days.
  • Add a regime filter. Pairs relationships behave differently in calm versus stressed markets. Gate trades on the market state using a regime model — a clean pairing with the HMM market regimes approach.
  • Scale to a universe. Use vectorbt to scan and backtest hundreds of cointegrated pairs quickly, then borrow the falling-knife defenses from the Bollinger Bands and RSI mean-reversion article to manage the ones that decouple.

Conclusion

Pairs trading is the cleanest way to express a view that has nothing to do with where the market is headed. The strategy itself is a few lines of statsmodels — an OLS hedge ratio, a stationary spread, a z-score band. The hard part, and the part that decides whether the thing makes money out of sample, is the discipline around it: testing cointegration honestly, refusing to snoop thousands of pairs, lagging every signal, and accepting that even a good pair can decouple without warning. Get that discipline right and you have something rare in a retail toolkit — a return stream that genuinely zigs when the rest of your book zags.