Tag: statistical problems

  • Mean Reversion and Stationarity in Statistics

    In statisticsregression toward the mean (also called regression to the meanreversion to the mean, and reversion to mediocrity) is the phenomenon where if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean.[2][3][4] Furthermore, when many random variables are sampled and the most extreme results are intentionally picked out, it refers to the fact that (in many cases) a second sampling of these picked-out variables will result in “less extreme” results, closer to the initial mean of all of the variables.

    Mathematically, the strength of this “regression” effect is dependent on whether or not all of the random variables are drawn from the same distribution, or if there are genuine differences in the underlying distributions for each random variable. In the first case, the “regression” effect is statistically likely to occur, but in the second case, it may occur less strongly or not at all.

    Regression toward the mean is thus a useful concept to consider when designing any scientific experiment, data analysis, or test, which intentionally selects the most extreme events – it indicates that follow-up checks may be useful in order to avoid jumping to false conclusions about these events; they may be genuine extreme events, a completely meaningless selection due to statistical noise, or a mix of the two cases.

    source: https://en.wikipedia.org/wiki/Regression_toward_the_mean

    How to test for mean reversion:

    Mathematically, a continuous mean-reverting time series can be represented by an Ornstein-Uhlenbeck stochastic differential equation in the following form:

    Where θ is the rate of reversion to the mean, μ is the mean value of the process, σ is the variance of the process and, finally, Wt is a Wiener process.
    The given equation implies that the change of the time series in the next period is proportional to the difference between the mean and the current value, with the addition of Gaussian noise.

    source: https://medium.com/bluekiri/simple-stationarity-tests-on-time-series-ad227e2e6d48

    We can see mean reversion as the line of linear regression as in this plot:

    A key concept in testing for mean reversion is that of stationarity:

    In mathematics and statistics, a stationary process (also called a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose statistical properties, such as mean and variance, do not change over time. More formally, the joint probability distribution of the process remains the same when shifted in time. This implies that the process is statistically consistent across different time periods. Because many statistical procedures in time series analysis assume stationarity, non-stationary data are frequently transformed to achieve stationarity before analysis.

    source: https://en.wikipedia.org/wiki/Stationary_process

    (source: https://www.youtube.com/watch?v=I3NjeRXIs5k, accessed 10/3/25)

    The Augmented Dickey-Fuller test provides a quick check and confirmatory evidence that your time series is stationary or non-stationary. The ADF test is based on the simple observation that if the value level is higher than the mean, the next move will be downward while if the value is lower than the mean, the next move will be upward.

    In the python code below we will simply interpret the result using the p-value from the test. A p-value below a specified threshold (we are going to use 5%) suggests we reject the null hypothesis (stationary), otherwise a p-value above the threshold suggests we accept the null hypothesis (non-stationary).

    import numpy as np
    from statsmodels.regression.linear_model import OLS
    from statsmodels.tsa.tsatools import lagmat, add_trend
    from statsmodels.tsa.adfvalues import mackinnonp
    
    def adf(ts):
        """
        Augmented Dickey-Fuller unit root test
        """
        # make sure we are working with an array, convert if necessary
        ts = np.asarray(ts)
        
        # Get the dimension of the array
        nobs = ts.shape[0]
        
        # We use 1 as maximum lag in our calculations
        maxlag = 1
        
        # Calculate the discrete difference
        tsdiff = np.diff(ts)
        
        # Create a 2d array of lags, trim invalid observations on both sides
        tsdall = lagmat(tsdiff[:, None], maxlag, trim='both', original='in')
        # Get dimension of the array
        nobs = tsdall.shape[0] 
        
        # replace 0 xdiff with level of x
        tsdall[:, 0] = ts[-nobs - 1:-1]  
        tsdshort = tsdiff[-nobs:]
        
        # Calculate the linear regression using an ordinary least squares model    
        results = OLS(tsdshort, add_trend(tsdall[:, :maxlag + 1], 'c')).fit()
        adfstat = results.tvalues[0]
        
        # Get approx p-value from a precomputed table (from stattools)
        pvalue = mackinnonp(adfstat, 'c', N=1)
        return pvalue

    source: https://medium.com/bluekiri/simple-stationarity-tests-on-time-series-ad227e2e6d48

    this code can also bevalidated by referencing the function adfuller, included in the Python module statsmodels

    One can also test the stationarity by using the Hurst test. This measures the speed of diffusion in mean reversion, which should be slower then in a geometric random walk. The speed of diffusion is measured by it’s variance.

    In code we can test for the Hurst exponent in the following code from Corrius (2018):

    def hurst(ts):
        """
        Returns the Hurst Exponent of the time series vector ts
        """
        # make sure we are working with an array, convert if necessary
        ts = np.asarray(ts)
    
        # Helper variables used during calculations
        lagvec = []
        tau = []
        # Create the range of lag values
        lags = range(2, 100)
    
        #  Step through the different lags
        for lag in lags:
            #  produce value difference with lag
            pdiff = np.subtract(ts[lag:],ts[:-lag])
            #  Write the different lags into a vector
            lagvec.append(lag)
            #  Calculate the variance of the difference vector
            tau.append(np.sqrt(np.std(pdiff)))
    
        #  linear fit to double-log graph
        m = np.polyfit(np.log10(np.asarray(lagvec)),
                       np.log10(np.asarray(tau).clip(min=0.0000000001)),
                       1)
        # return the calculated hurst exponent
        return m[0]*2.0

    source: https://medium.com/bluekiri/simple-stationarity-tests-on-time-series-ad227e2e6d48

    We interpet the results by the following rules:

    H=0.5, is a geometric random walk; for a mean reverting series, H<0.5, and, finally, for a trending series H>0.5. H also is an indicator for the degree of mean reversion or trendiness: as H decreases towards 0, the series is more mean reverting and as it increases towards 1, it is more trending.

    To make sure it is not a random walk we can test the statistical significance of the H value with the Variance Ratio Test:

    import numpy as np
    
    def variance_ratio(ts, lag = 2):
        """
        Returns the variance ratio test result
        """
        # make sure we are working with an array, convert if necessary
        ts = np.asarray(ts)
        
        # Apply the formula to calculate the test
        n = len(ts)
        mu  = sum(ts[1:n]-ts[:n-1])/n;
        m=(n-lag+1)*(1-lag/n);
        b=sum(np.square(ts[1:n]-ts[:n-1]-mu))/(n-1)
        t=sum(np.square(ts[lag:n]-ts[:n-lag]-lag*mu))/m
        return t/(lag*b);
    
    #Source: Corrius (2018)

    The test involves dividing the variance of group one by the variance of group two. If this ratio is close to one the conclusion drawn is that the variance of each group is the same. If the ratio is far from one the conclusion drawn is that the variances are not the same.

    So how long will it take for the time series to mean revert, to diffuse back to the mean? This is seen in measuring the ‘half-life’ of the mean reversion.

    import numpy as np
    
    def half_life(ts):  
        """ 
        Calculates the half life of a mean reversion
        """
        # make sure we are working with an array, convert if necessary
        ts = np.asarray(ts)
        
        # delta = p(t) - p(t-1)
        delta_ts = np.diff(ts)
        
        # calculate the vector of lagged values. lag = 1
        lag_ts = np.vstack([ts[1:], np.ones(len(ts[1:]))]).T
       
        # calculate the slope of the deltas vs the lagged values 
        beta = np.linalg.lstsq(lag_ts, delta_ts)
        
        # compute and return half life
        return (np.log(2) / beta[0])[0]
    
    #source: Corrius (2018)

    So we can see that we can understand mean regression in programming, namely for fintech, through the following steps:

    1. Test for stationarity using the Augmented Dickey Fuller test (ADF Test)
    2. Confirm by testing the Hurst Exponent (H)
    3. Test for the variance ratio, F-ratio test
    4. test for the time to mean revert using the half life test

  • Comparative Z-Scores for Stock Prices in Varying Sample Sizes

    A quick study of z-scores in varying samples of NVDA stock prices.

    I used python to generate these plots.

    animation of NVDA zscore scatter plot x , y are in standard deviation units, anything above 3 or below -3 is considered outliers in data science. DeepSeek release was nearly an outlier effect on NVDA value. Image shows last 15 days, 30 days, 45 days, 60 days and 93 days.

    Z-Scores for QQQ ETF Tech Index stock, for same time periods above for NVDA. In an index stock such as QQQ one should see a smoother spectrum as it is less susceptible to volatility.

    First, we see the last 93 trade days price chart. Then afterwards, we take a look at the 93, 60, 45, 30, 15 sample windows all going backward in time from Feb. 20th, 2025. One major petrubative wave that hit the stock was the release of DeepSeek which had a negative impact temporarily on NVDA value. The question is whether one can see a correlation to the action of the Index ETF for the sector NVDA is in and is a part of the portfolio. Examining the spread in the z-scores to see if one can tell if it is an indicator of up or down motion in relation to the index for NVDA stock prices.

    Comparing different plots of Z-scores for the price of stocks for NVDA form Oct 4th, 2024 to February 20, 2025.

    NVDA past 93 trading days. open price 124.92, close price 140.11 (gained 13%)

    QQQ ETF, of which NVDA is a member of the portfolio, past 93 trading days, open price: 487.32 close price: 537.23 (gained 10.2%)

    Z-Scores:

    z-score is calcualted as:

    93 Days:

    NVDA past 93 trading days zscores

    QQQ past 93 trading days

    I use the terms "prices" as code for NVDA and "trends" for QQQ, the index ETF for tech stocks. 
    
    Some Data for Z-scores:
    
    Shape of Z-score plots: 
    prices max/min, trends max/min:  1.7594241527219399 -2.9115388106292883 1.7722687913449942 -2.0281115555314564
    
    
    NVDA length and mean of positive and negative:
    prices positive len:  51 0.7123723838425315
    prices negative len:  43 -0.8449067808364951
    
    prices positive list:  [0.14019500216746938, 0.13021735775384247, 0.9441080663512041, 0.9270035330706985, 0.35257629040040717, 0.4737334011373164, 0.6348010895287374, 0.4894125566444479, 0.5934651341008516, 0.3205052904994612, 0.40246451246854575, 1.2149298432925333, 1.6810283751862907, 1.502856153514364, 1.1650416212243906, 1.5969310865571407, 1.30900477633531, 1.3788482872307024, 0.6975177115572552, 0.43667357902955695, 1.4144827315650879, 1.254840420947041, 1.3660198872703233, 0.6932415782371287, 0.16585180208822778, 0.2200161574764928, 0.4523527345366844, 1.147937087943885, 1.136534065756884, 0.7630850891325253, 0.2456729573972512, 0.3169418460660218, 0.03614242471106574, 0.36825544590753456, 0.44665122344318386, 0.40531526801529805, 0.057523091311697735, 0.17440406872848058, 1.052436777127734, 1.7594241527219399, 0.4352482012561788, 0.43097206793605647, 0.08888140232595665, 0.5335992676190859, 1.4230349982053405, 1.4444156648059727, 0.7887418890532837, 0.25137446849075173, 0.32977024602640104, 0.30553882387901676, 0.43097206793605647]
    
    prices negative list:  [-1.6826224330881445, -2.0281115555314564, -1.5515748349199896, -1.2967600607041379, -1.3338240278628077, -1.2828610730196344, -1.0088524586680414, -1.4489870686772424, -1.4450159293388134, -1.4225128064210513, -1.2093949952586989, -1.146518622400242, -1.1107783683543844, -1.6137893512220423, -1.3516941548857364, -1.1531371879642918, -1.1478423355130543, -0.8327986146643587, -1.082980392985381, -1.9122866581606137, -1.6753420109676906, -1.7693256419771755, -1.3589745770061903, -0.47407236109295114, -0.032614037970937836, -0.26823497205105246, -1.0704051184136898, -0.8420646064540289, -0.6143859510507724, -0.6335797911865096, -0.5137837544772378, -0.4601733734084476, -0.4072248488960653, -0.22719986555395358, -0.49260434467228414, -0.19874003362854603, -0.10012340672422786, -0.16630906236470946, -0.36751345551177483, -0.4753960742057596, -0.507165188913192]
    
    
    Trends (QQQ) length and mean for positive/negative:
    trends positive len:  53 0.7493882152106839
    trends negative len:  41 -0.9687213513699346
    
    trends positive list:  [0.05276545780528101, 0.09247685118957147, 0.07262115449743001, 0.011068494751777909, 0.1685903551761238, 0.2731636910880874, 0.6960900306307639, 0.6001208299520626, 0.9092078417931164, 0.6378466536671404, 0.520036186627085, 1.1368864971963693, 0.9105315549059249, 1.1772597471370638, 1.6829181562303424, 1.5260581523624006, 0.2466894288318963, 0.09446242085878412, 0.39163601468454706, 0.670277624930977, 1.1395339234219937, 1.1157070873914194, 0.6497600716824238, 0.18976976498108122, 0.3863411622333134, 0.7808076698505786, 0.16130993305566993, 0.16726664206331537, 0.2619121296292083, 0.022320056210664595, 0.595487834057233, 0.7980159403171032, 1.2421216896647371, 1.316911480538481, 1.117692657060632, 0.0971098470844085, 0.6001208299520626, 0.535258887424397, 0.6821910429462604, 0.631889944659495, 0.3552339040822852, 0.776174673955749, 0.9336965343800949, 1.1157070873914194, 0.6735869077129981, 1.0925421079172493, 1.009148181810243, 1.0296657350587888, 1.5326767179264504, 1.681594443117534, 1.7623409429989234, 1.7722687913449942, 1.6207036399282937]
    
    trends negative list:  [-1.6826224330881445, -2.0281115555314564, -1.5515748349199896, -1.2967600607041379, -1.3338240278628077, -1.2828610730196344, -1.0088524586680414, -1.4489870686772424, -1.4450159293388134, -1.4225128064210513, -1.2093949952586989, -1.146518622400242, -1.1107783683543844, -1.6137893512220423, -1.3516941548857364, -1.1531371879642918, -1.1478423355130543, -0.8327986146643587, -1.082980392985381, -1.9122866581606137, -1.6753420109676906, -1.7693256419771755, -1.3589745770061903, -0.47407236109295114, -0.032614037970937836, -0.26823497205105246, -1.0704051184136898, -0.8420646064540289, -0.6143859510507724, -0.6335797911865096, -0.5137837544772378, -0.4601733734084476, -0.4072248488960653, -0.22719986555395358, -0.49260434467228414, -0.19874003362854603, -0.10012340672422786, -0.16630906236470946, -0.36751345551177483, -0.4753960742057596, -0.507165188913192]
    
    zscore silos: 
    prices 0 to 1:  35 0.4056562939002893
    prices 1 to 2:  16 1.3722767337624457
    prices 2 to 3:  0 nan
    prices 3>:  0 nan
    prices 0 to -1:  31 -0.4417795985574785
    prices -1 to -2:  8 -1.4510292935080256
    prices -2 to 3:  4 -2.7127490308408726
    prices <3:  0 nan
    
    trends 0 to 1:  35 -0.27555447552077394
    trends 1 to 2:  17 1.344595325715943
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  20 -0.41098618215387406
    trends -1 to -2:  21 -1.3951694395818528
    trends -2 to 3:  1 -2.0790028405616057
    trends <3:  0 nan

    60 DAYS

    NVDA past 60 days open: 146.67, close: 140.11

    QQQ past 60 days open: 504.98, close: 537.23

    "prices" is for NVDA, "trends" is for QQQ
    
    some data for 60 day plots:
    
    
    
    shape of plots: 
    prices max/min, trends max/min:  1.9145807927766163 -2.645347442651133 1.8695586413001115 -1.7542554651846929
    
    
    
    prices positive len:  34 0.6792199553321576
    prices negative len:  26 -0.8882107108189714
    
    prices positive list:  [1.5305282468571573, 0.8737427335456234, 0.04858635771143208, 0.1738208835547299, 0.35888968285649864, 0.4117664826570029, 0.6385801239065363, 1.3176295529235456, 1.3064975950708098, 0.9419259753936441, 0.43681338782566403, 0.5063881244052757, 0.23226366228160591, 0.556481934742594, 0.6330141449801685, 0.5926607977639948, 0.18634433613906048, 0.2531360832554902, 0.36723865124605237, 1.2243994059068675, 1.9145807927766163, 0.6218821871274288, 0.617707702932656, 0.03327991566391562, 0.07919924180646105, 0.2837489673505192, 0.7178953236072966, 1.5861880361208474, 1.6070604570947318, 0.9669728805623052, 0.4423793667520319, 0.5189115769896063, 0.49525616655253607, 0.617707702932656]
    
    prices negative list:  [-1.7542554651846929, -1.669273142479056, -1.5853399842512712, -1.2999672462767844, -1.720682201893578, -1.2548531737293467, -0.6725668885240624, -0.506798901024174, -0.11546055078710396, -0.5487654801380664, -0.7900733100429681, -0.318998459489499, -0.6389936252329415, -1.098527666530088, -1.203444114314825, -0.32739177531227037, -0.6841076977803849, -0.6746652174797553, -1.5223901155804267, -1.6934039254695472, -1.7437638204062227, -0.524634697147575, -0.9044322381283226, -0.7858766521315705, -0.0913297677966126, -0.3767025057710995]
    
    trends positive len:  34 0.7207852536147648
    trends negative len:  26 -0.9425653316500864
    
    trends positive list:  [0.1636172003202925, 0.01148835103241942, 0.5014481621871478, 0.07129072626972463, 0.8623607425666361, 0.5035464911428407, 0.926359775715327, 1.7279214367907145, 1.4792694555408878, 0.12269978568424661, 0.8665574004780336, 0.828787479275527, 0.09017568687097201, 0.29791025348475275, 0.004144199687494526, 0.32518852990878333, 1.0291778945443708, 1.147733480541123, 0.8319349727090662, 0.01148835103241942, 0.14158474628549397, 0.06184824596909498, 0.2905661021398278, 0.5402672478675009, 0.828787479275527, 0.12794560807347868, 0.7920667225508667, 0.6598719983420994, 0.6923960971553621, 1.4897611003193638, 1.7258231078350217, 1.8538211741324033, 1.8695586413001115, 1.6292999758730682]
    
    trends negative list:  [-1.7542554651846929, -1.669273142479056, -1.5853399842512712, -1.2999672462767844, -1.720682201893578, -1.2548531737293467, -0.6725668885240624, -0.506798901024174, -0.11546055078710396, -0.5487654801380664, -0.7900733100429681, -0.318998459489499, -0.6389936252329415, -1.098527666530088, -1.203444114314825, -0.32739177531227037, -0.6841076977803849, -0.6746652174797553, -1.5223901155804267, -1.6934039254695472, -1.7437638204062227, -0.524634697147575, -0.9044322381283226, -0.7858766521315705, -0.0913297677966126, -0.3767025057710995]
    
    zscore silos: 
    prices 0 to 1:  26 0.4857168466015017
    prices 1 to 2:  7 1.4753238481642317
    prices 2 to 3:  0 nan
    prices 3>:  0 nan
    prices 0 to -1:  20 -0.42879927835305054
    prices -1 to -2:  3 -1.5721141625126993
    prices -2 to 3:  4 -2.4158942235473466
    prices <3:  0 nan
    
    trends 0 to 1:  23 -0.06220277723223607
    trends 1 to 2:  9 1.5519738719284009
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  18 -0.4898406138886327
    trends -1 to -2:  10 -1.5547354231811998
    trends -2 to 3:  0 nan
    trends <3:  0 nan

    45 DAYS

    NVDA past 45 days open: 134.25, close: 140.11

    QQQ past 45 days open: 530.53, close: 537.23

    some data for 45 day plot:
    
    
    
    shape of plots: 
    prices max/min, trends max/min:  2.0056164520976707 -2.335406431884398 1.7445748897642444 -2.004426294606925
    
    
    zscore means of prices and trends:  4.46309655899313e-15 -1.0288066694859784e-14
    
    NVDA prices length and mean of positive/negative:
    prices positive len:  24 0.7357314773777098
    prices negative len:  21 -0.8408359741459445
    
    prices positive list:  [0.054341899730996124, 0.712714915702643, 0.7855730965445773, 0.7471569648279229, 0.36034625926711406, 0.4239315807291669, 2.943764882851058e-05, 0.5325565048935021, 1.3485681303231487, 2.0056164520976707, 0.7749755429675673, 0.7710014603761928, 0.2146298975832493, 0.2583448060884106, 0.45307485306593986, 0.8663794425692682, 1.6929886215759211, 1.7128590345328127, 1.1034997038548302, 0.13249885736143355, 0.6040899915383078, 0.6769481723802421, 0.6544283710290971, 0.7710014603761928]
    
    prices negative list:  [-0.7645562745724367, -1.014925110927081, -0.5261615999565064, -0.06787777341171075, -0.10162313831168997, -0.8581724481659111, -1.334961797397784, -1.443817813204148, -0.5348700812210075, -0.9049805349626544, -0.8951834935400782, -1.7747401012554962, -1.9521754070198687, -2.004426294606925, -0.739519390936971, -1.1335781681560086, -0.19088507127290136, -1.010570870294818, -0.18326515016646283, -0.28994404565668896, -0.048283690566570704, -0.1310142625794062, -0.5860324086500015, -0.0624349726213975]
    
    pos/neg length and mean:
    trends positive len:  21 0.8835238047359075
    trends negative len:  24 -0.7730833291439385
    
    trends positive list:  [0.7659593076650326, 1.5976192684256507, 1.3396305109645679, 0.7039113786554122, 0.6647232129651197, 0.11391177298491237, 0.14221433709456596, 0.8726382031552712, 0.9956455010164618, 0.6679888934393077, 0.10629185187847384, 0.3653691694976192, 0.6647232129651197, 0.6266236074328899, 0.4894650275168725, 0.5232103924168393, 1.3505161125452068, 1.5954421481095253, 1.7282464873932923, 1.7445748897642444, 1.4952946135676752]
    
    trends negative list:  [-0.7645562745724367, -1.014925110927081, -0.5261615999565064, -0.06787777341171075, -0.10162313831168997, -0.8581724481659111, -1.334961797397784, -1.443817813204148, -0.5348700812210075, -0.9049805349626544, -0.8951834935400782, -1.7747401012554962, -1.9521754070198687, -2.004426294606925, -0.739519390936971, -1.1335781681560086, -0.19088507127290136, -1.010570870294818, -0.18326515016646283, -0.28994404565668896, -0.048283690566570704, -0.1310142625794062, -0.5860324086500015, -0.0624349726213975]
    
    zscore silos: 
    prices 0 to 1:  19 0.5159193883045129
    prices 1 to 2:  4 1.4639440396810512
    prices 2 to 3:  1 2.005079787550808
    prices 3>:  0 nan
    prices 0 to -1:  14 -0.38097024812529595
    prices -1 to -2:  4 -1.454678571075471
    prices -2 to 3:  3 -2.1703421886682137
    prices <3:  0 nan
    
    trends 0 to 1:  13 0.18729980332171914
    trends 1 to 2:  8 1.5002054504426652
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  16 -0.42194589789701936
    trends -1 to -2:  7 -1.3777869923741477
    trends -2 to 3:  1 -2.0051154449596136
    trends <3:  0 nan

    30 DAYS

    NVDA past 30 days, during which it took a major dive with the release of DeepSeek close: open: 140.14, close: 140.11

    QQQ past 30 days open: 515.18, close: 537.23

    Some Data for 30 day plot
    
    shape of zscore plot
    prices max/min, trends max/min:  1.7395624418627935 -2.0102455727613866 1.635662150597431 -1.980244163451975
    
    beats_condition, max, min:  False True False
    avg zscores list:  [-0.04900609322786642, -0.043237954955050006, -1.4069222487518587, -1.9069027691119136, -2.1376722944324356, -0.3679613081523311, -1.0756478274354047, 0.34156933192037153, 1.0456770404126563, 2.5158374261919283, 2.6528832614306106, 1.7724244766420925, -2.8159600623456074, -0.7210526618453241, -1.4730439252605447, -1.1233948714886564, -1.765168780240289, -2.622450849103493, -1.7105258315300362, -0.7023407103306647, 0.0587929352193316, -0.500214456769451, 0.6220642834292512, 0.39529360882055975, 0.2241537498518348, 1.531306979740637, 2.2043609778904045, 2.3999373023930133, 2.3948264891602498, 2.2623747818778503]
    zscore differential avg:  -4.4704980458239636e-15
    zscore means of prices and trends:  -7.919590908992784e-16 -3.796962744218036e-15
    
    trend count and mean of pos or neg:
    prices positive len:  17 0.7172287353811013
    prices negative len:  13 -0.9379145001137494
    
    prices positive list:  [0.8708241976370331, 0.8671431033818425, 0.35178990765469526, 0.022945487524039902, 0.3922819444618296, 0.06466455574957113, 0.5726555629663304, 0.9554893655064959, 1.7211569705868235, 1.7395624418627935, 1.1751279893997306, 0.06466455574957113, 0.2757139597140209, 0.7125371446636966, 0.780023872675586, 0.7591643385628187, 0.8671431033818425]
    
    prices negative list:  [-0.9198302908648995, -0.9103810583368925, -1.758712156406554, -1.9298482566359534, -1.980244163451975, -0.7602432526141607, -1.1403123831849757, -0.23108623104595882, -1.0216720192222633, -0.22373682796862931, -0.326628471051326, -0.09354740202724439, -0.1733409211526078, -0.6122052763421065, -0.10719629345658256]
    
    trend count and mean of pos or neg:
    trends positive len:  15 0.8125990002508011
    trends negative len:  15 -0.8125990002508086
    
    trends positive list:  [0.09018767490616038, 0.7946804556051048, 0.9133208195678171, 0.5972964872423618, 0.055540488970154546, 0.30542019359958456, 0.5941467430663635, 0.5573997276796802, 0.42511047228762966, 0.45765782877296995, 1.2555930200266159, 1.4918238332267078, 1.6199134297174271, 1.635662150597431, 1.395231678496008]
    
    trends negative list:  [-0.9198302908648995, -0.9103810583368925, -1.758712156406554, -1.9298482566359534, -1.980244163451975, -0.7602432526141607, -1.1403123831849757, -0.23108623104595882, -1.0216720192222633, -0.22373682796862931, -0.326628471051326, -0.09354740202724439, -0.1733409211526078, -0.6122052763421065, -0.10719629345658256]
    
    zscore silos: 
    prices 0 to 1:  14 0.5195964415593288
    prices 1 to 2:  3 1.5885045560705986
    prices 2 to 3:  0 nan
    prices 3>:  0 nan
    prices 0 to -1:  7 -0.38925651381809206
    prices -1 to -2:  5 -1.4606024370377964
    prices -2 to 3:  1 -2.0120560681267605
    prices <3:  0 nan
    
    trends 0 to 1:  11 0.23112599712881077
    trends 1 to 2:  5 1.4620954964259174
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  9 -0.4261916020115731
    trends -1 to -2:  4 -1.522227807745561
    trends -2 to 3:  1 -2.0472318717466145
    trends <3:  0 nan

    15 DAYS

    NVDA past 15 days during which price regained momentum and climbed back up. open: 124.65, min on day81 at 116.64, close: 140.11

    QQQ past 15 days open: 523.05, close: 537.23.

    Some Data for 15 day plot
    
    shape of z-score: 
    prices max/min, trends max/min:  1.2978590301422221 -1.7892515585278195 1.4875489718437718 -1.7015455028399435
    
    
    prices positive len:  8 0.7961212804164766
    prices negative len:  7 -0.9098528919045411
    prices positive list:  [0.4368908744960763, 0.3355230641218386, 0.11698986253581116, 0.6633228665008721, 1.1319844313480085, 1.2043900101867528, 1.1820101040002302, 1.2978590301422221]
    
    prices negative list:  [-0.9657152036835966, -1.0789198650922653, -1.7015455028399435, -0.7542012310515969, -0.3996918966402359, -0.9850791589245527, -0.04220349219180732, -0.22988490452723273, -0.18370931895265175]
    
    trends positive len:  6 1.056825095650698
    trends negative len:  9 -0.7045500637670981
    trends positive list:  [0.009930233456925725, 0.9483372951340528, 1.2834826743044578, 1.465205946565748, 1.4875489718437718, 1.1464454525992316]
    
    trends negative list:  [-0.9657152036835966, -1.0789198650922653, -1.7015455028399435, -0.7542012310515969, -0.3996918966402359, -0.9850791589245527, -0.04220349219180732, -0.22988490452723273, -0.18370931895265175]
    
    zscore silos: 
    prices 0 to 1:  5 0.3387271345868971
    prices 1 to 2:  4 1.13156977020328
    prices 2 to 3:  0 nan
    prices 3>:  0 nan
    prices 0 to -1:  3 -0.4156379102954067
    prices -1 to -2:  3 -1.6576670076204636
    prices -2 to 3:  0 nan
    prices <3:  0 nan
    
    trends 0 to 1:  1 0.584379796953797
    trends 1 to 2:  4 1.3465306598267066
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  7 -0.3399223000040687
    trends -1 to -2:  3 -1.315505544516218
    trends -2 to 3:  0 nan
    trends <3:  0 nan

    The above charts and data are generated in the following code snippets.

    This code snippet gets data into Pandas Dataframes from the Alpaca API.

    ############### INIT CEILLI CLASSES ####################     
    
    from classes.stock_list import StockList
    from classes.config import Config
    from classes.alpaca import Alpaca
    from classes.utilities import Utilities
    from classes.market_beat import MarketBeat
    from classes.profit_loss import ProfitLoss
    from classes.plots import Plots
    
    util = Utilities(pd.DataFrame())
    conf = Config(api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
    mb = MarketBeat(pd.DataFrame(), api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
    alpa = Alpaca(api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
    stocks = StockList()
    plots = Plots(pd.DataFrame())
    
    ############## SETTINGS ###################
    #CONSTANTS, see setting.toml for conflicts, set here to overide settings.toml file Constants
    ALGO_VERSION = conf.algo_version
    BASE_CURRENCY = conf.base_currency
    
    ############# LOGGING #################
    import logging
    logging.basicConfig(
        filename="logs/charts_"+ALGO_VERSION+".log",
        level=logging.INFO,
        format="%(asctime)s:%(levelname)s:%(message)s"
        )
    
    alpa = Alpaca(api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
    
    ############################### CONFIGS ###################################################
    # API Credentials alpaca4 edge 
    API_KEY = conf.api_key
    API_SECRET = conf.api_secret
    API_BASE_URL = conf.api_base_url
    SECRET_KEY = API_SECRET
    
    #CONSTANTS
    TIMEZONE_OFFSET = -4.0 #set in config file, this is deprecated, i think
    
    if DEBUG:
        PROCESS_ROWS = 0  #set to low number for debugging, otherwise 1000
    else:
        PROCESS_ROWS = 1000
    
    
    
    ########################### DRIVER #######################################
    date = DATE
    
    from datetime import date
    from datetime import timedelta
    import datetime
    from datetime import datetime, timezone, timedelta
    
    N_DAYS_AGO = 500
    YESTERDAY = 1
    #today = datetime.now()
    today = date.today()    
    n_days_ago = today - timedelta(days=N_DAYS_AGO)
    one_day_ago = today - timedelta(days=YESTERDAY)
    
    today = date.today()
    timezone_offset = -4 # EST is -4, that is 4 hours behind GMT
    tzinfo = timezone(timedelta(hours=timezone_offset))
    now = datetime.now(tzinfo)
    back_time = now - timedelta(minutes=15)
    date = back_time.strftime("%Y-%m-%d %H:%M:%S")
    start_time = now - timedelta(minutes=45)
    start = start_time.strftime("%Y-%m-%d %H:%M:%S")
    end = date
    
    beg_date = str(n_days_ago) + ' 00:00:00'
    end_date = str(one_day_ago) + ' 23:59:00'
    
    
    
    
    
    if MODE == 'SCREENER' or MODE == 'HISTORICAL':
        try:
    
            #STOCK_LIST = stocks.TECH_AL
            STOCK_LIST = ['NVDA', 'MSFT']
            
    
            
            STOCK_SET = set(STOCK_LIST) #remove duplicates from list
            STOCK_LIST = list(STOCK_SET)
            STOCK_LIST = sorted(STOCK_LIST)
            symbol_list = STOCK_LIST
    
            
            index_symbol = stocks.stock_index(ALGO_VERSION)
            
    
            cnt = 0
            for symbol in symbol_list:
                print(ALGO_VERSION)
                print(symbol)
                print(index_symbol)
                
    
                hundred_dates = alpa.get_calendar(str(n_days_ago), str(one_day_ago))
    
    
                #get prices for symbol in trading list
                symbol_price_data = alpa.stockbars_by_symbol_by_day(symbol, beg_date, end_date)
                symbol_price_data = symbol_price_data.reset_index(level=("symbol", "timestamp"))
                prices_data = symbol_price_data
                #get prices for trend index for symbol above
                index_price_data = alpa.stockbars_by_symbol_by_day(index_symbol, beg_date, end_date)
                index_price_data = index_price_data.reset_index(level=("symbol", "timestamp")) #alpaca dataframe return has an index of symbol, timestamp format
                column_names = index_price_data.columns
                trends_data = index_price_data
                symbol_prices = symbol_price_data
    
    
                column_names = prices_data.columns
                print(column_names)
                prices_data = symbol_prices[['timestamp', 'symbol', 'open', 'close', 'vwap']].copy()
                trends_data = trends_data[['timestamp', 'symbol', 'open', 'close', 'vwap']].copy()
                prices_data.rename(columns = {'timestamp':'date'}, inplace = True)
                trends_data.rename(columns = {'timestamp':'date'}, inplace = True)
    
                #prices_data = prices_data.reset_index()
                #trends_data = trends_data.reset_index()
                date_stamp = prices_data.iloc[0]['date']
        
    
    
                print()
                print()
                print("Statistical Analysis: ")
    
    
                prices_arr = np.array(prices_data['close'])
    
                from scipy.stats import skew, kurtosis 
                # Calculate the skewness 
                print("Symbol Prices skew: ")
                print(skew(prices_data['close'], axis=0, bias=True))
                print("Index skew: ")
                print(skew(trends_data['close'], axis=0, bias=True))
    
                # Calculate the kurtosis 
                print("Symbol Prices kurtosis: ")
                print(kurtosis(prices_data['close'], axis=0, bias=True))
                print("Index kurtosis: ")
                print(kurtosis(trends_data['close'], axis=0, bias=True))
    
                print()
                print("Covariance between the two: ")
                cov_matrix = np.stack((prices_data['close'], trends_data['close']), axis = 0) 
                print(np.cov(cov_matrix))
    
                print()
                print("Correlation between the two: ")
                correlations = np.correlate(prices_data['close'], trends_data['close']) 
                print(correlations)
    
                print()
                print()
                
                print("Mean of the Symbol: ")
                data_mean = np.mean(prices_data['close'])
                data_max = max(prices_data['close'])
                data_min = min(prices_data['close'])
                print("mean is: " + str(data_mean))
                print("max/min is: "+str(max(prices_data['close'])), str(min(prices_data['close'])))
                
                print()
                print()
    
                print("Variance of the Symbol: ")      
                m = sum(prices_data['close']) / len(prices_data['close'])
                std_dev = np.std(prices_data['close'])
                print("std dev: "+str(std_dev))
    
                import scipy.stats as scipy
                zscore_list = scipy.zscore(prices_data['close'])
                print("symbol z-scores list: ")
                print(zscore_list)
                trends_zscore_list = scipy.zscore(trends_data['close'])
                print("trends z-scores list: ")
                print(trends_zscore_list)
    
                
                import statistics
                # Calculate the variance from a sample of data
                data_variance = statistics.variance(prices_data['close'])
                print("variance result: "+str(data_variance))
    
                print()
                print()
    
                print("Market Beat Metrics: ")
                #prices = prices_data.iloc[:lookback_period]
                vars, vibe_check = mb.compare_rates(trends_data, prices_data)
                vars15, vibe_check15 = mb.compare_rates(trends_data[-15:], prices_data[-15:])
                vars30, vibe_check30 = mb.compare_rates(trends_data[-30:], prices_data[-30:])
                vars45, vibe_check45 = mb.compare_rates(trends_data[-45:], prices_data[-45:])
                vars60, vibe_check60 = mb.compare_rates(trends_data[-60:], prices_data[-60:])
    
                print(vars)
    
    

    the zscores are put into silos based on standard deviation in a Market Beat Class function, a snippet from that is following, which appends the zscore value to a list based on each silo or bin, I included the logic here because it can be beneficial to be able to sort these ito bins:

                 
                    if current_idx_z >= 0:
                        trends_positive.append(current_idx_z)
                    else:
                        trends_negative.append(current_idx_z)
    
                    if current_price_z >= 0:
                        prices_positive.append(current_price_z)
                    else:
                        prices_negative.append(current_price_z)
    
    
    
                    if current_price_z > 0 and current_price_z < 1:
                          prices_0to1.append(current_price_z)
                    elif current_price_z > 1 and current_price_z < 2 :
                          prices_1to2.append(current_price_z)
                    elif current_price_z > 2 and current_price_z < 3:
                          prices_2to3.append(current_price_z)
                    elif current_price_z > 3 and current_price_z < 8:
                          prices_3up.append(current_price_z)
                    elif current_price_z <= 0 and current_price_z > -1:
                          prices_0toneg1.append(current_price_z)
                    elif current_price_z <= 1 and current_price_z > -2:
                          prices_neg1toneg2.append(current_price_z)
                    elif current_price_z <= 2 and current_price_z > -3:
                          prices_neg2toneg3.append(current_price_z)
                    elif current_price_z <= 3:
                          prices_neg3.append(current_price_z)
    
    
                    if current_idx_z >= 0 and current_idx_z < 1:
                          trends_0to1.append(current_price_z)
                    elif current_idx_z >= 1 and current_idx_z < 2:
                          trends_1to2.append(current_idx_z)
                    elif current_idx_z >= 2 and current_idx_z < 3:
                          trends_2to3.append(current_idx_z)
                    elif current_idx_z >= 3:
                          trends_3up.append(current_idx_z)
                    elif current_idx_z < 0 and current_idx_z > -1:
                          trends_0toneg1.append(current_idx_z)
                    elif current_idx_z < 1 and current_idx_z > -2:
                          trends_neg1toneg2.append(current_idx_z)
                    elif current_idx_z < 2 and current_idx_z > -3:
                          trends_neg2toneg3.append(current_idx_z)
                    elif current_idx_z < -3 and current_idx_z > -8:
                          trends_neg3.append(current_idx_z)
                                  

    The graphing part is handled in a Plots Class that is called by this code:

    
                path_15_index = 'plots/stats/zscores/scatter/'+str(today)+'_'+index_symbol+'_15.png'
                print(path_15_index)
                isFile = os.path.isfile(path_15_index)
                if isFile == False:
                    symbol_zscores_plot = plots.zscores_scatter_by_day(today, index_symbol, trends_data['zscores'][-15:], '15')
                else:
                    print(index_symbol + ' zscores scatter plot file exists for this date')
                

    Then in the plots class I generate the plots:

    in the Plots Class:
    
    
    
        def zscores_scatter_by_day(self, plot_date, symbol, data, periodicity='all'):
                plot_date = str(plot_date)
                zscores = data.reset_index(drop = True)
        
                #zscores = zscores.tolist()
                print(zscores)
                # PLOTTING
                import matplotlib.pyplot as plt
                zscores_set = set(zscores) #remove duplicates from list
                zscores_list = list(zscores_set)
                zscores_list = sorted(zscores_list)
                print("zscores sorted and unique: ", zscores_list)
                import seaborn as sns
                sns.displot(zscores_list, color="maroon")
                plt.xlabel("zscore", labelpad=14)
                plt.ylabel("probability of occurence", labelpad=14)
                plt.title("Percent Ratio Z-scores distribution" + plot_date, y=1.015, fontsize=10);
                #plt.show()
                plt.savefig('plots/stats/zscores/'+symbol+'_'+str(plot_date)+'_'+periodicity+'.png',bbox_inches='tight')
                plt.clf()
    
                import matplotlib.pyplot as plt2
    
                x_cnt = 0
                color = 'grey'
                # https://matplotlib.org/stable/gallery/color/named_colors.html
                for i in zscores:
                    if i < 0 and i > -1:
                        color = 'orange'
                    elif i < -1 and i > -2:
                        color = 'indianred'
                    elif i < -2 and i > -3:
                        color = 'firebrick'
                    elif i < -3 and i > -4:
                        color = 'maroon'
    
                    elif i > 0 and i < 1:
                        color = 'yellow'
                    elif i > 1 and i < 2:
                        color = 'green' 
                    elif i > 2 and i < 3:
                        color = 'forestgreen'
                    elif i > 3 and i < 4:
                        color = 'darkgreen' 
                    elif i > 4 and i < 5:
                        color = 'darkolivegreen'
                    elif i > 5:
                        color = 'black' 
    
                    print(zscores)
    
                    plt2.scatter(i, zscores[x_cnt], c=color) 
                    x_cnt += 1
                # depict first scatted plot
                #plt.scatter(x, y, c='blue')
                print('plots/stats/zscores/scatter/'+str(plot_date)+'_'+symbol+'_'+periodicity+'.png')
                plt2.savefig('plots/stats/zscores/scatter/'+str(plot_date)+'_'+symbol+'_'+periodicity+'.png',bbox_inches='tight')
                plt2.clf()
                # depict illustration
                #plt.show()
    
    

    this function outputs the plots into a directory for safe keeping and reference as needed. The first part of the function generates the zscore bar charts and the second part of the function generates the rainbow spectrum charts of zscores. You have to pass in a dataframe of zscores to be plotted, plus the other apparent variables that are easy to figure out for oneself. You’ll need to include these libraries in your own code for this function to work.

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import plotly.graph_objects as go
    import plotly.express as px
    from plotly.subplots import make_subplots
    from plotly.offline import iplot, init_notebook_mode
    
    import seaborn as sns
    
        def zscores_scatter_by_day(self, plot_date, symbol, data, periodicity='all'):
    ...
    ...
    ...

    I hope this can provide some insights to others on how to plot z-scores and work with pandas.

  • Cautionary Tales of Statistical Methods

    Cautionary Tales of Statistical Methods

        There are some very important things to understand with Statistical Methods in Artificial Intelligence.  The very basis of using statistics is a mathematical approximation of real things in the real world.  Therefore, this approximation will always be an oversimplification of real things.  However, for a mathematical based machine, such as the computer, it must use mathematical methods and not necessarily related to natural mechanics, to formulate a response, that which is returned from an Algorithm.  The next section will go over Machine Bias that develops from these mathematical statistical methods.  We should not treat returns from computation to be some sort of objective truth, rather it is affected by mathematical equations and the limitations of approximation.

        Another complicating factor, and source for noise and results filled with errors if algorithms that are not properly set up, is the use of Probability in AI.  Ian Goodfellow notes the use of probability in AI: 


    Probability theory is a mathematical framework for representing uncertain statements. It provides a means of quantifying uncertainty as well as axioms for deriving new uncertain statements. In artificial intelligence applications, we use probability theory in two major ways. First, the laws of probability tell us how AI systems should reason, so we design our algorithms to compute or approximate various expressions derived using probability theory. Second, we can use probability and statistics to theoretically analyze the behavior of proposed AI systems.


    Machine learning must always deal with uncertain quantities and sometimes stochastic (nondeterministic) quantities. Uncertainty and stochasticity can arise from many sources. Researchers have made compelling arguments for quantifying uncertainty using probability since at least the 1980s.  (Goodfellow, et al, 2016, Ch. 3)

        Goodfellow et al go on to talk about 3 possible sources of uncertainty: Inherent stochasticity (non-determinism) in the system being modeled; Incomplete observability and Incomplete modelling.  In the first case randomness has an effect, in the second, which happens in an open world, if we do not have complete data or know all the pieces of a chess game and a limited board, then we have Incomplete Observability. In the third case, the problem of outliers is a good example, AI algorithms ignore data that lies at the periphery in relation to the other data points, and on the other hand, if there is too much specificity in rules, it can lead to overfitting and break the model.
        Probability extends logic to deal with uncertainty. Logic provides a set of formal rules, such as Piercean Logic, for determining what propositions are implied to be true or false given the assumption that some other set of propositions is true or false or in the Piercean Logic the third option: true & false, which raises certain questions about non-Boolean methods for validation checking, etc. Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.  As one will not find any examples from Russell & Norvig, researchers and authors of what is considered the textbook on AI of Piercean Logic other than mention of C.S. Pierce in historical summaries, if developers use Piercean Logic then what form of validation do they use to check it is an accurate result? Since, Piercean Logic is used in defense sector computer engineering for automation and control this is an important question.
        Another source of noise or error in computational algorithms that rely on Artificial Intelligence is the problem of numerical computation.  Goodfellow et al explain:

        The fundamental difficulty in performing continuous math on a digital computer is that we need to represent infinitely many real numbers with a finite number of bit patterns. This means that for almost all real numbers, we incur some approximation error when we represent the number in the computer. In many cases, this is just rounding error. Rounding error is problematic, especially when it compounds across many operations, and can cause algorithms that work in Numerical Computation theory to fail in practice if they are not designed to minimize the accumulation of rounding error. One form of rounding error that is particularly devastating is underflow. Underflow occurs when numbers near zero are rounded to zero. Many functions behave qualitatively differently when their argument is zero rather than a small positive number. Another highly damaging form of numerical error is overflow. Overflow occurs when numbers with large magnitude are approximated as ∞ or −∞. Further arithmetic will usually change these infinite values into not-a-number values. One example of a function that must be stabilized against underflow and overflow is the softmax function.
    (Goodfellow et al, 2016, Ch. 4)

    Yet, as many programmers of algorithms can tell you there is often the binary choice between 0,1.  So, if we always assume a division between 0,1 then the rounding error may come into play.  As one can see just because we have a machine that performs at unbelievably fast rates, and usually produces correct results, with these and other complicating factors such a system that is not accurate and precise can generate large errors, that can also grow exponentially over time.  One journalist writing about the use of AI in major corporations makes a great point:

    Mathematicians say that it’s impossible to make a “perfect decision” because of systems of complexity and because the future is always in flux, right down to a molecular level. It would be impossible to predict every single possible outcome, and with an unknowable number of variables, there is no way to build a model that could weigh all possible answers. Decades ago, when the frontiers of AI involved beating a human player at checkers, the decision variables were straightforward. Today, asking an AI to weigh in on a medical diagnosis or to predict the next financial market crash involves data and decisions that are orders of magnitude more complex. So instead, our systems are built for optimization. Implicit in optimizing is unpredictability—to make choices that deviate from our own human thinking. (Webb, 2019)

    These are just some of the factors that can also lead to Machine Bias in AI Models, since results from an AI algorithm depend on the correctness of the Model a bad model can lead to a completely destructive algorithm. There are also other sources of Machine Bias which we shall now cover. 

    Machine Bias
       
        In a much reported incident it was discovered that Africans in Image Recognition software were being identified as Gorillas rather then as humans[1].  This raised great concerns about the bias of Machine Intelligence algorithms employed in Image Recognition, although it is true that a hard-right political activist, Robert Mercer, was involved in the development of Image Recognition software while at IBM, it was discovered that the problem was in the AIs inability to properly deal with dark colors.  This is just one example of Machine Bias in an AI system, once again, if we do consider a machine to be neutral and objective, this is not always true. For instance, problems are also reported in automated policing systems that rely on AI, where it seems to target in the US, African-Americans.  These results were also demonstrated in automated judicial processes where AI is employed to decide court cases.  As well there is the episode where Microsoft released a chat bot AI that quickly was skewed to voice far-right neo-Nazi rhetoric based on ‘data poisoning’ attacks (see below), and Google Search algorithms that also skewed toward the far right [2]. In dealing with automated Intelligence systems, never mind that generally the culture in Intelligence is biased toward the Right, there would be necessarily already existing biases which could slant the AI system into an even more biased depiction of actual events in the world. Just ask yourself, how would a system that is investigating ISIS, or other Islamist Jihadist groups, view Muslims in general, would it be biased? A paper on the dangers of Lethal Autonomous Weapons Systems (LAWS) points out some of the dangers of machine bias:

    However, as ‘intelligent’ software and machines need to be ‘fed’ by a huge amount of data in order to ‘learn’ (a trait that we deem ‘intelligent’), there exists the risk that they learn human prejudices from biased data. And so-called machine biases constitute a danger for AI-controlled or autonomous systems that some experts regard as far more acute than LAWS. Based on the data a bot is fed in order to learn, it could learn, e.g., to discriminate against people of color or minorities, or gain a strict political attitude. (Shurber, 2018, pg. 17-8)

    Bias is not just limited to the problem of Intelligence or of Social Sciences.  It can even be encountered in autonomous systems calculating each other in Financial Markets, when autonomous agents (or semiotic agents) are left to their own devices they capitalize, leading to economic bias:

    Researchers at the University of Bologna in Italy created two simple reinforcement-learning-based pricing algorithms and set them loose in a controlled environment. They discovered that the two completely autonomous algorithms learned to respond to one another’s behavior and quickly pulled the price of goods above where it would have been had either operated alone.
    (Calvano et al, 2018)


        Although, we may assume that since, in many cases, we can look at the code of an AI algorithm that we would be able to understand what is happening from the results of an AI system.  However, this is not always true, here we encounter what is known as the ‘blackbox’ problem.  In Software Engineering we have such terms as white box and black box testing, in these cases a black box tester does not have access to the code or inner workings of the software undergoing testing, but has to figure out bugs based on this blindedness, which can actually be of more value, since the tester is not biased by the code and has knowledge of what it’s intended purpose is but rather can only go on the effects.  In AI the blackbox comes into affect in a more complex way:

    That inability to observe how AI is optimizing and making its decisions is what’s known as the “black box problem.” Right now, AI systems built by the Big Nine might offer open-source code, but they all function like proprietary black boxes. While they can describe the process, allowing others to observe it in real time is opaque. With all those simulated neurons and layers, exactly what happened and in which order can’t be easily reverse-engineered. (Webb, 2019)

        As is pointed out even the developers that code the algorithm cannot fully give an account or even properly anticipate the results of an AI algorithm due to the blackbox problem.  Now imagine, developing an Automated Surveillance system with this problem in mind and also bias, the dangers of such a system are not hard to realize. 

    Security of AI

        Data poisoning in machine learning refers to the intentional or unintentional injection of malicious or incorrect data into the training dataset of a machine learning model with the goal of manipulating its behavior. This can happen in various ways, for example:

    An attacker might add instances to the training data that are specifically designed to cause the model to make incorrect predictions. An attacker might also manipulate existing instances in the training data to cause the model to make incorrect predictions.

    A researcher might unknowingly use biased or contaminated data in their experiments, resulting in a poisoned model. Data poisoning can have severe consequences, as it can cause a machine learning model to make incorrect or biased predictions, undermining its usefulness and potentially causing harm. The effects of data poisoning can be especially severe in sensitive applications such as medical diagnosis, self-driving cars, and financial fraud detection, where incorrect predictions can have serious consequences.

        To prevent data poisoning, it is important to ensure that the training data is cleaned, preprocessed and validated. One way to do this is by using techniques such as data sanitization and data validation. Additionally, it’s important to monitor the performance of the model during training and testing to detect any signs of data poisoning.

        Machine learning also poses certain cyber security concerns that need to be addressed. Some of the main cyber security concerns with machine learning include:

    Data privacy and security: Machine learning models are trained on large amounts of data, which can include sensitive information such as personal information, financial data, and medical records. This data needs to be protected from unauthorized access and misuse to prevent data breaches and protect individuals’ privacy.

    Adversarial attacks: Machine learning models are vulnerable to adversarial attacks, in which an attacker attempts to manipulate the input data to cause the model to make incorrect predictions. This can be done by introducing small, carefully crafted changes to the input data (such as adding noise to an image), which can cause the model to make incorrect predictions without being detected.

    Model inversion attacks: Machine learning models can be used to reverse-engineer sensitive information from the model’s internal representations. This can be done by feeding the model input data and observing the internal representations, which can reveal sensitive information such as personal information, medical records, and financial data.

    Poisoning attacks: Machine learning models are also vulnerable to poisoning attacks, in which an attacker injects malicious data into the training data in an attempt to manipulate the model’s behavior. This can cause the model to make incorrect predictions and can be difficult to detect.

    Privacy issues with federated learning: Federated learning is a machine learning technique in which data is distributed among multiple devices, rather than being centrally stored. This can improve data privacy, but it also raises concerns about the security of the data on individual devices and the potential for data breaches.

    To mitigate these cyber security concerns, it’s important to implement robust security measures such as data encryption, secure communication protocols, and access controls. Additionally, it’s important to monitor the performance of the model during training and testing to detect any signs of attacks, and to use techniques such as adversarial training to make the model more robust to attacks. It should also be pointed out that ML is increasingly being used by crackers or ‘hackers’ to infiltrate systems in an automated way. 

        As mentioned above attacks on Machine Learning algorithms can be accomplished through data poisoning.  Researchers define data poisoning:

    Machine learning systems trained on user-provided data are susceptible to data poisoning attacks, whereby malicious users inject false training data with the aim of corrupting the learned model. While recent work has proposed a number of attacks and defenses, little is understood about the worst-case loss of a defense in the face of a determined attacker. (Steinhardt et al, 2018)

    In a certain sense one can see a parallel between data poisoning and Information Operations, or Thought Injection attacks, which resembles the hacking technique of SQL Injection attacks were a hacker injects SQL code via http requests to corrupt or poison a database or dataset or collection.  The study of data poisoning began around 2012, some time after ML was used in Defense contracting, whether there are mechanisms to protect against data poisoning in their systems is unknown due to the classified nature of such work.  There are countermeasures being developed against such attacks.  At the 2018 IEEE Symposium on Security and Privacy Matthew Jagielski provided one such methodology. In the below diagram we can see the flow of the attack vector and the results.  Again, validating results becomes a key criteria in developing resilient and accurate algorithms, so that developers will have to be vigilant with their Test and Validate cycles in AI as it becomes larger and larger in our society. 

    Algorithmic Complexity

        Aside from Machine Bias another area of concern in algorithm engineering is that of Algorithmic Complexity (See Graph below).  One element that is a complicating factor in algorithmic complexity is the degree to which other components of a system are visible and connected to other elements within the program.  As such, a compartmentalized non-visible architecture as used in covert operations and computation will by necessity of it’s invisibility to each component create an complex algorithmic environment.  Thus leading to unforeseen programming outputs and possibly circular and contradictory logic.  This is the opposite of that foreseen in the Cybernetics of Stafford Beer. Where each node in the algorithmic management system is visible.  In this sense Beer’s algorithms, which are being developed by sustainability advocates for such things as democratic finance, is the opposite of a covert system.  Transparency thus giving it a technical advantage.  Beer’s system would resemble the nodes on the right in the chart below, whereas a covert system would resemble the graph on the left below.