In statistics, regression toward the mean (also called regression to the mean, reversion to the mean, and reversion to mediocrity) is the phenomenon where if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean.[2][3][4] Furthermore, when many random variables are sampled and the most extreme results are intentionally picked out, it refers to the fact that (in many cases) a second sampling of these picked-out variables will result in “less extreme” results, closer to the initial mean of all of the variables.
Mathematically, the strength of this “regression” effect is dependent on whether or not all of the random variables are drawn from the same distribution, or if there are genuine differences in the underlying distributions for each random variable. In the first case, the “regression” effect is statistically likely to occur, but in the second case, it may occur less strongly or not at all.
Regression toward the mean is thus a useful concept to consider when designing any scientific experiment, data analysis, or test, which intentionally selects the most extreme events – it indicates that follow-up checks may be useful in order to avoid jumping to false conclusions about these events; they may be genuine extreme events, a completely meaningless selection due to statistical noise, or a mix of the two cases.
Mathematically, a continuous mean-reverting time series can be represented by an Ornstein-Uhlenbeck stochastic differential equation in the following form:
Where θ is the rate of reversion to the mean, μ is the mean value of the process, σ is the variance of the process and, finally, Wt is a Wiener process. The given equation implies that the change of the time series in the next period is proportional to the difference between the mean and the current value, with the addition of Gaussian noise.
We can see mean reversion as the line of linear regression as in this plot:
A key concept in testing for mean reversion is that of stationarity:
In mathematics and statistics, a stationary process (also called a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose statistical properties, such as mean and variance, do not change over time. More formally, the joint probability distribution of the process remains the same when shifted in time. This implies that the process is statistically consistent across different time periods. Because many statistical procedures in time series analysis assume stationarity, non-stationary data are frequently transformed to achieve stationarity before analysis.
The Augmented Dickey-Fuller test provides a quick check and confirmatory evidence that your time series is stationary or non-stationary. The ADF test is based on the simple observation that if the value level is higher than the mean, the next move will be downward while if the value is lower than the mean, the next move will be upward.
In the python code below we will simply interpret the result using the p-value from the test. A p-value below a specified threshold (we are going to use 5%) suggests we reject the null hypothesis (stationary), otherwise a p-value above the threshold suggests we accept the null hypothesis (non-stationary).
import numpy as np
from statsmodels.regression.linear_model import OLS
from statsmodels.tsa.tsatools import lagmat, add_trend
from statsmodels.tsa.adfvalues import mackinnonp
def adf(ts):
"""
Augmented Dickey-Fuller unit root test
"""
# make sure we are working with an array, convert if necessary
ts = np.asarray(ts)
# Get the dimension of the array
nobs = ts.shape[0]
# We use 1 as maximum lag in our calculations
maxlag = 1
# Calculate the discrete difference
tsdiff = np.diff(ts)
# Create a 2d array of lags, trim invalid observations on both sides
tsdall = lagmat(tsdiff[:, None], maxlag, trim='both', original='in')
# Get dimension of the array
nobs = tsdall.shape[0]
# replace 0 xdiff with level of x
tsdall[:, 0] = ts[-nobs - 1:-1]
tsdshort = tsdiff[-nobs:]
# Calculate the linear regression using an ordinary least squares model
results = OLS(tsdshort, add_trend(tsdall[:, :maxlag + 1], 'c')).fit()
adfstat = results.tvalues[0]
# Get approx p-value from a precomputed table (from stattools)
pvalue = mackinnonp(adfstat, 'c', N=1)
return pvalue
this code can also bevalidated by referencing the function adfuller, included in the Python module statsmodels
One can also test the stationarity by using the Hurst test. This measures the speed of diffusion in mean reversion, which should be slower then in a geometric random walk. The speed of diffusion is measured by it’s variance.
In code we can test for the Hurst exponent in the following code from Corrius (2018):
def hurst(ts):
"""
Returns the Hurst Exponent of the time series vector ts
"""
# make sure we are working with an array, convert if necessary
ts = np.asarray(ts)
# Helper variables used during calculations
lagvec = []
tau = []
# Create the range of lag values
lags = range(2, 100)
# Step through the different lags
for lag in lags:
# produce value difference with lag
pdiff = np.subtract(ts[lag:],ts[:-lag])
# Write the different lags into a vector
lagvec.append(lag)
# Calculate the variance of the difference vector
tau.append(np.sqrt(np.std(pdiff)))
# linear fit to double-log graph
m = np.polyfit(np.log10(np.asarray(lagvec)),
np.log10(np.asarray(tau).clip(min=0.0000000001)),
1)
# return the calculated hurst exponent
return m[0]*2.0
H=0.5, is a geometric random walk; for a mean reverting series, H<0.5, and, finally, for a trending series H>0.5. H also is an indicator for the degree of mean reversion or trendiness: as H decreases towards 0, the series is more mean reverting and as it increases towards 1, it is more trending.
To make sure it is not a random walk we can test the statistical significance of the H value with the Variance Ratio Test:
import numpy as np
def variance_ratio(ts, lag = 2):
"""
Returns the variance ratio test result
"""
# make sure we are working with an array, convert if necessary
ts = np.asarray(ts)
# Apply the formula to calculate the test
n = len(ts)
mu = sum(ts[1:n]-ts[:n-1])/n;
m=(n-lag+1)*(1-lag/n);
b=sum(np.square(ts[1:n]-ts[:n-1]-mu))/(n-1)
t=sum(np.square(ts[lag:n]-ts[:n-lag]-lag*mu))/m
return t/(lag*b);
#Source: Corrius (2018)
The test involves dividing the variance of group one by the variance of group two. If this ratio is close to one the conclusion drawn is that the variance of each group is the same. If the ratio is far from one the conclusion drawn is that the variances are not the same.
So how long will it take for the time series to mean revert, to diffuse back to the mean? This is seen in measuring the ‘half-life’ of the mean reversion.
import numpy as np
def half_life(ts):
"""
Calculates the half life of a mean reversion
"""
# make sure we are working with an array, convert if necessary
ts = np.asarray(ts)
# delta = p(t) - p(t-1)
delta_ts = np.diff(ts)
# calculate the vector of lagged values. lag = 1
lag_ts = np.vstack([ts[1:], np.ones(len(ts[1:]))]).T
# calculate the slope of the deltas vs the lagged values
beta = np.linalg.lstsq(lag_ts, delta_ts)
# compute and return half life
return (np.log(2) / beta[0])[0]
#source: Corrius (2018)
So we can see that we can understand mean regression in programming, namely for fintech, through the following steps:
Test for stationarity using the Augmented Dickey Fuller test (ADF Test)
Confirm by testing the Hurst Exponent (H)
Test for the variance ratio, F-ratio test
test for the time to mean revert using the half life test
Leave a Reply