Category: Statistics

theoretical writing on the question of statistics

  • AI LLM Text Extraction of Financial Data (EPS)

    One of the projects I am working on is a fintech application–Céillí. One of the data points that are presented for the stock trading community is that of Earnings Per Share (EPS). It’s a key financial metric that represents the portion of a company’s profit allocated to each outstanding share of common stock. EPS is calculated by dividing the company’s net income (after preferred dividends, if any) by the average number of outstanding common shares during the period. It’s commonly used to gauge a company’s profitability and is a vital input for valuation metrics like the price-to-earnings (P/E) ratio.

    Below is an example Python script that calculates the Earnings Per Share (EPS) for a commodity-producing company (or any company) and then uses that EPS value to compute the Price-to-Earnings (P/E) ratio.you should have the net income and outstanding shares data to calculate this:

    
    def calculate_eps(net_income: float, num_shares: float) -> float:
        """
        Calculate Earnings Per Share (EPS).
    
        Args:
            net_income (float): The company's net income.
            num_shares (float): The average number of outstanding shares.
    
        Returns:
            float: EPS value.
        """
        if num_shares == 0:
            raise ValueError("Number of shares cannot be zero")
        return net_income / num_shares

    getting EPS is a cumbersome task as the data is not in a freely available API, and those that do have that data charge a hefty fee which can eat into your returns, this informatino is usually contained in transcripts of earnings reports and general market reports in the media.

    For my acquisition of EPS data I rely on spidering news headlines which I also use for sentiment analysis purposes. I noticed that much of the information needed for fundamental analysis, rather then technical analysis, are contained in the news headlines as well as the transcripts of earnings calls where companies present their productive value at quarterly and annual conference calls. Usually, one would rely on REGEX to extract information from text. Now, with the advent of Large Language Models and Generative AI, this task can become alot less cumbersome and automated. One way to do this is to use python and a LLM that is hosted by Huggingface and developed by NuMind, https://huggingface.co/numind/NuExtract-1.5-tiny

    At this link you can find more code samples using python. For my purposes, I used this code to extract EPS from the text. How this basically works is that you use a json model to structure the data and the LLM is able to use this to place key datapoints from the text into this model, which of course could easily be automatically pushed to a database for any UI needs you may have.

    import json
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    
    
    def predict_NuExtract(model, tokenizer, text, schema, examples=["", "", ""]):
        # Parse and reformat the schema
        schema = json.dumps(json.loads(schema), indent=4)
        input_llm = "<|input|>\n" + schema + "\n"
        
        # Only add examples if they are non-empty valid JSON strings
        for ex in examples:
            if ex.strip():  # only process if not empty
                input_llm += json.dumps(json.loads(ex), indent=4) + "\n"
        
        # Add the text to extract data from
        input_llm += "### Text:\n" + text + "\n<|output|>\n"
        
        # Tokenize and generate output
        input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to(device)
        #output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
        output = tokenizer.decode(
        model.generate(**input_ids, use_cache=False)[0], skip_special_tokens=True)
        return output.split("<|output|>")[1].split("<|end-output|>")[0]
    
    model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-1.5-tiny", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract-1.5-tiny", trust_remote_code=True)
    model.to(device)
    #device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    model.eval()
    
    text = ["Relmada Therapeutics Q4 2024 GAAP EPS $(0.62) Beats $(0.70) Estimate.",
    "Clearside Biomedical Q4 2024 GAAP EPS $(0.10), Inline, Sales $306.00K Beat $176.67K Estimate.",
    "Argan Q4 2024 GAAP EPS $2.22 Beats $1.15 Estimate, Sales $232.474M Beat $197.500M Estimate.",
    "Plus Therapeutics FY24 EPS $(1.95) Vs. $(4.24) YoY, Grant Revenue $5.8M Up From $4.9M YoY",
    "SeaStar Medical Holding Q4 EPS $(0.90) Misses $(0.89) Estimate, Sales $67.00K Miss $150.00K Estimate.",
    "Pulse Biosciences Q4 EPS $(0.31) Down From $(0.21) YoY.",
    "CalAmp FY 2024 GAAP EPS $(11.04), Inline.",
    "VirTra Q4 2024 GAAP EPS $(0.08) Misses $0.04 Estimate, Sales $5.40M Miss $7.45M Estimate.",
    "Better Choice Q4 EPS $(0.50), Sales $7.2M Up 26% From YoY."]
    
    schema = """{
        "company": "",
        "period": "",
        "eps_data": {
            "eps_type": "",
            "actual_eps": "",
            "eps_estimate": "",
            "eps_result": ""
        },
        "sales_data": {
            "actual_sales": "",
            "sales_estimate": "",
            "sales_result": ""
        }
    }"""
    
    for i in text:
        prediction = predict_NuExtract(model, tokenizer, i, schema) 
        print(prediction)   
    
    
    
    '''
    Output: 
    
    {
        "company": "Relmada Therapeutics",
        "period": "Q4 2024",
        "eps_data": {
            "eps_type": "GAAP",
            "actual_eps": "0.62",
            "eps_estimate": "0.70",
            "eps_result": "$(0.62)"
        },
        "sales_data": {
            "actual_sales": "",
            "sales_estimate": "",
            "sales_result": ""
        }
    }
    
    Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
    
    {
        "company": "Clearside Biomedical",
        "period": "Q4 2024",
        "eps_data": {
            "eps_type": "GAAP",
            "actual_eps": "0.10",
            "eps_estimate": "176.67K",
            "eps_result": ""
        },
        "sales_data": {
            "actual_sales": "$306.00K",
            "sales_estimate": "$176.67K",
            "sales_result": ""
        }
    }
    
    Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
    
    {
        "company": "Argan",
        "period": "Q4 2024",
        "eps_data": {
            "eps_type": "GAAP",
            "actual_eps": "$2.22",
            "eps_estimate": "$1.15",
            "eps_result": "$232.474M"
        },
        "sales_data": {
            "actual_sales": "$232.474M",
            "sales_estimate": "$197.500M",
            "sales_result": ""
        }
    }
    
    Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
    
    {
        "company": "Plus Therapeutics",
        "period": "FY24",
        "eps_data": {
            "eps_type": "EPS",
            "actual_eps": "1.95",
            "eps_estimate": "4.24",
            "eps_result": ""
        },
        "sales_data": {
            "actual_sales": "5.8M",
            "sales_estimate": "",
            "sales_result": ""
        }
    }
    
    Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
    
    {
        "company": "SeaStar Medical Holding",
        "period": "Q4",
        "eps_data": {
            "eps_type": "",
            "actual_eps": "0.90",
            "eps_estimate": "0.89",
            "eps_result": "Misses"
        },
        "sales_data": {
            "actual_sales": "$67.00K",
            "sales_estimate": "$150.00K",
            "sales_result": ""
        }
    }
    
    Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
    
    {
        "company": "Pulse Biosciences",
        "period": "Q4",
        "eps_data": {
            "eps_type": "EPS",
            "actual_eps": "0.31",
            "eps_estimate": "0.21",
            "eps_result": "Down From"
        },
        "sales_data": {
            "actual_sales": "",
            "sales_estimate": "",
            "sales_result": ""
        }
    }
    
    Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
    
    {
        "company": "CalAmp",
        "period": "FY 2024",
        "eps_data": {
            "eps_type": "GAAP",
            "actual_eps": "11.04",
            "eps_estimate": "",
            "eps_result": ""
        },
        "sales_data": {
            "actual_sales": "",
            "sales_estimate": "",
            "sales_result": ""
        }
    }
    
    Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
    
    {
        "company": "VirTra",
        "period": "Q4 2024",
        "eps_data": {
            "eps_type": "GAAP",
            "actual_eps": "0.08",
            "eps_estimate": "0.04",
            "eps_result": "Misses"
        },
        "sales_data": {
            "actual_sales": "$5.40M",
            "sales_estimate": "$7.45M",
            "sales_result": ""
        }
    }
    
    Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
    
    {
        "company": "Better Choice",
        "period": "Q4",
        "eps_data": {
            "eps_type": "EPS",
            "actual_eps": "0.50",
            "eps_estimate": "",
            "eps_result": ""
        },
        "sales_data": {
            "actual_sales": "$7.2M",
            "sales_estimate": "",
            "sales_result": "Up 26% From YoY"
        }
    }
    
    
    '''

    As one can see results can vary, although one could introduce steps into this basic process to further refine the models or even use case-based models that are specific to different patterns of headlines, as there are set ways that financial data is presented in the market news. Other financial ratios are also reported in the news, so having such an ability can help, such as collecting EBITDA, P/E, etc.

    OpenAI’s chatGPT had this to add to EPS discussion that I found interesting:

    me: is there an optimal eps value or is it relative to each stock? what constitutes a miss?

    ChatGPT said:

    Reasoned about optimal EPS and “Miss” for a couple of seconds

    EPS (Earnings Per Share) isn’t a one-size-fits-all metric—it’s relative to the company, its industry, historical performance, and market expectations. There isn’t an “optimal” EPS value that applies across the board. Instead, investors look at trends, growth rates, and how the current EPS compares to analysts’ forecasts or the company’s own guidance.

    What Constitutes a “Miss”

    A “miss” typically occurs when a company’s reported EPS is below the consensus estimate or its internal guidance. Analysts and investors use these estimates as benchmarks, so if the actual EPS falls short, it’s seen as underperformance—even if the EPS might be strong in absolute terms. The severity of the miss can affect investor sentiment and, consequently, the stock’s price.

    Key Considerations

    • Relative Performance:
      For one company, an EPS of, say, $2.00 might be excellent if the industry average is $1.00, while another company might be expected to have an EPS of $5.00 based on its size and market position.
    • Growth Trends:
      Investors often consider not just the current EPS but also its trend over time. Consistent growth can be more important than a single high or low number.
    • Market Expectations:
      EPS is closely watched around earnings announcements. A miss relative to consensus estimates can trigger a negative reaction, even if the company is profitable overall.

    In summary, the “optimal” EPS is relative and context-dependent, and a miss is typically defined by falling short of pre-determined expectations rather than a fixed numeric benchmark.

    me: if i have the eps for a company how do i get the P/E, what is the price i use?

    ChatGPT said:

    Reasoned for 10 seconds

    The Price/Earnings (P/E) ratio is calculated by dividing the current market price of a company’s stock by its earnings per share (EPS). So if you have the EPS, the price you use is typically the current share price—often the last traded or closing price.

    For example, if a company’s EPS is $2.00 and its current share price is $30.00, then the P/E ratio is calculated as:P/E=Current Share PriceEPS=30.002.00=15\text{P/E} = \frac{\text{Current Share Price}}{\text{EPS}} = \frac{30.00}{2.00} = 15P/E=EPSCurrent Share Price​=2.0030.00​=15

    There are a couple of variations depending on your analysis:

    • Trailing P/E: Uses the current share price and the EPS from the past 12 months.
    • Forward P/E: Uses the current share price and forecasted EPS.

    In most standard cases, you’d use the latest closing price from the market to compute the P/E ratio.

    o3-mini-high

    You said:

    me: how do you get the trailing p/e, do you use daily bars? and how do you get the forecasted EPS?

    ChatGPT said:

    Reasoned for 12 seconds

    Trailing P/E is calculated by dividing the current market price (usually the last closing price) by the earnings per share over the trailing twelve months (TTM EPS). Here’s how it typically works:

    • Trailing P/E:
      • Price: You take the most recent closing price of the stock. While daily bars provide the closing price, you don’t calculate the EPS from daily bars. Instead, you use the EPS figure from the company’s quarterly reports and annualize it (i.e., sum up the last four quarters of reported EPS).
      • EPS: The earnings data comes from the company’s financial reports. For example, if you have EPS for Q1, Q2, Q3, and Q4, you add those together to get the TTM EPS.
      • Calculation: Trailing P/E=Current Stock PriceTTM EPS\text{Trailing P/E} = \frac{\text{Current Stock Price}}{\text{TTM EPS}}Trailing P/E=TTM EPSCurrent Stock Price​
    • Forecasted (Forward) EPS:
      • Source: Forecasted EPS isn’t derived directly from historical price data or daily bars. Instead, it’s typically provided by analysts’ consensus estimates or company guidance. Financial data providers (like Bloomberg, FactSet, or Reuters) compile these forecasts.
      • Usage: You take the current stock price and divide it by the forecasted EPS (often for the next 12 months) to get the forward P/E ratio.

    In summary, daily bars are mainly used to get the current price (closing price) for the calculation. The earnings figure for trailing P/E comes from reported quarterly earnings, and forecasted EPS comes from analysts’ estimates or management’s guidance.

    Here is a development version, although rough in visual style as a prototype, gives you a picture of how you can put this altogether for algorithmic insights into the market.

  • Mean Reversion and Stationarity in Statistics

    In statisticsregression toward the mean (also called regression to the meanreversion to the mean, and reversion to mediocrity) is the phenomenon where if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean.[2][3][4] Furthermore, when many random variables are sampled and the most extreme results are intentionally picked out, it refers to the fact that (in many cases) a second sampling of these picked-out variables will result in “less extreme” results, closer to the initial mean of all of the variables.

    Mathematically, the strength of this “regression” effect is dependent on whether or not all of the random variables are drawn from the same distribution, or if there are genuine differences in the underlying distributions for each random variable. In the first case, the “regression” effect is statistically likely to occur, but in the second case, it may occur less strongly or not at all.

    Regression toward the mean is thus a useful concept to consider when designing any scientific experiment, data analysis, or test, which intentionally selects the most extreme events – it indicates that follow-up checks may be useful in order to avoid jumping to false conclusions about these events; they may be genuine extreme events, a completely meaningless selection due to statistical noise, or a mix of the two cases.

    source: https://en.wikipedia.org/wiki/Regression_toward_the_mean

    How to test for mean reversion:

    Mathematically, a continuous mean-reverting time series can be represented by an Ornstein-Uhlenbeck stochastic differential equation in the following form:

    Where θ is the rate of reversion to the mean, μ is the mean value of the process, σ is the variance of the process and, finally, Wt is a Wiener process.
    The given equation implies that the change of the time series in the next period is proportional to the difference between the mean and the current value, with the addition of Gaussian noise.

    source: https://medium.com/bluekiri/simple-stationarity-tests-on-time-series-ad227e2e6d48

    We can see mean reversion as the line of linear regression as in this plot:

    A key concept in testing for mean reversion is that of stationarity:

    In mathematics and statistics, a stationary process (also called a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose statistical properties, such as mean and variance, do not change over time. More formally, the joint probability distribution of the process remains the same when shifted in time. This implies that the process is statistically consistent across different time periods. Because many statistical procedures in time series analysis assume stationarity, non-stationary data are frequently transformed to achieve stationarity before analysis.

    source: https://en.wikipedia.org/wiki/Stationary_process

    (source: https://www.youtube.com/watch?v=I3NjeRXIs5k, accessed 10/3/25)

    The Augmented Dickey-Fuller test provides a quick check and confirmatory evidence that your time series is stationary or non-stationary. The ADF test is based on the simple observation that if the value level is higher than the mean, the next move will be downward while if the value is lower than the mean, the next move will be upward.

    In the python code below we will simply interpret the result using the p-value from the test. A p-value below a specified threshold (we are going to use 5%) suggests we reject the null hypothesis (stationary), otherwise a p-value above the threshold suggests we accept the null hypothesis (non-stationary).

    import numpy as np
    from statsmodels.regression.linear_model import OLS
    from statsmodels.tsa.tsatools import lagmat, add_trend
    from statsmodels.tsa.adfvalues import mackinnonp
    
    def adf(ts):
        """
        Augmented Dickey-Fuller unit root test
        """
        # make sure we are working with an array, convert if necessary
        ts = np.asarray(ts)
        
        # Get the dimension of the array
        nobs = ts.shape[0]
        
        # We use 1 as maximum lag in our calculations
        maxlag = 1
        
        # Calculate the discrete difference
        tsdiff = np.diff(ts)
        
        # Create a 2d array of lags, trim invalid observations on both sides
        tsdall = lagmat(tsdiff[:, None], maxlag, trim='both', original='in')
        # Get dimension of the array
        nobs = tsdall.shape[0] 
        
        # replace 0 xdiff with level of x
        tsdall[:, 0] = ts[-nobs - 1:-1]  
        tsdshort = tsdiff[-nobs:]
        
        # Calculate the linear regression using an ordinary least squares model    
        results = OLS(tsdshort, add_trend(tsdall[:, :maxlag + 1], 'c')).fit()
        adfstat = results.tvalues[0]
        
        # Get approx p-value from a precomputed table (from stattools)
        pvalue = mackinnonp(adfstat, 'c', N=1)
        return pvalue

    source: https://medium.com/bluekiri/simple-stationarity-tests-on-time-series-ad227e2e6d48

    this code can also bevalidated by referencing the function adfuller, included in the Python module statsmodels

    One can also test the stationarity by using the Hurst test. This measures the speed of diffusion in mean reversion, which should be slower then in a geometric random walk. The speed of diffusion is measured by it’s variance.

    In code we can test for the Hurst exponent in the following code from Corrius (2018):

    def hurst(ts):
        """
        Returns the Hurst Exponent of the time series vector ts
        """
        # make sure we are working with an array, convert if necessary
        ts = np.asarray(ts)
    
        # Helper variables used during calculations
        lagvec = []
        tau = []
        # Create the range of lag values
        lags = range(2, 100)
    
        #  Step through the different lags
        for lag in lags:
            #  produce value difference with lag
            pdiff = np.subtract(ts[lag:],ts[:-lag])
            #  Write the different lags into a vector
            lagvec.append(lag)
            #  Calculate the variance of the difference vector
            tau.append(np.sqrt(np.std(pdiff)))
    
        #  linear fit to double-log graph
        m = np.polyfit(np.log10(np.asarray(lagvec)),
                       np.log10(np.asarray(tau).clip(min=0.0000000001)),
                       1)
        # return the calculated hurst exponent
        return m[0]*2.0

    source: https://medium.com/bluekiri/simple-stationarity-tests-on-time-series-ad227e2e6d48

    We interpet the results by the following rules:

    H=0.5, is a geometric random walk; for a mean reverting series, H<0.5, and, finally, for a trending series H>0.5. H also is an indicator for the degree of mean reversion or trendiness: as H decreases towards 0, the series is more mean reverting and as it increases towards 1, it is more trending.

    To make sure it is not a random walk we can test the statistical significance of the H value with the Variance Ratio Test:

    import numpy as np
    
    def variance_ratio(ts, lag = 2):
        """
        Returns the variance ratio test result
        """
        # make sure we are working with an array, convert if necessary
        ts = np.asarray(ts)
        
        # Apply the formula to calculate the test
        n = len(ts)
        mu  = sum(ts[1:n]-ts[:n-1])/n;
        m=(n-lag+1)*(1-lag/n);
        b=sum(np.square(ts[1:n]-ts[:n-1]-mu))/(n-1)
        t=sum(np.square(ts[lag:n]-ts[:n-lag]-lag*mu))/m
        return t/(lag*b);
    
    #Source: Corrius (2018)

    The test involves dividing the variance of group one by the variance of group two. If this ratio is close to one the conclusion drawn is that the variance of each group is the same. If the ratio is far from one the conclusion drawn is that the variances are not the same.

    So how long will it take for the time series to mean revert, to diffuse back to the mean? This is seen in measuring the ‘half-life’ of the mean reversion.

    import numpy as np
    
    def half_life(ts):  
        """ 
        Calculates the half life of a mean reversion
        """
        # make sure we are working with an array, convert if necessary
        ts = np.asarray(ts)
        
        # delta = p(t) - p(t-1)
        delta_ts = np.diff(ts)
        
        # calculate the vector of lagged values. lag = 1
        lag_ts = np.vstack([ts[1:], np.ones(len(ts[1:]))]).T
       
        # calculate the slope of the deltas vs the lagged values 
        beta = np.linalg.lstsq(lag_ts, delta_ts)
        
        # compute and return half life
        return (np.log(2) / beta[0])[0]
    
    #source: Corrius (2018)

    So we can see that we can understand mean regression in programming, namely for fintech, through the following steps:

    1. Test for stationarity using the Augmented Dickey Fuller test (ADF Test)
    2. Confirm by testing the Hurst Exponent (H)
    3. Test for the variance ratio, F-ratio test
    4. test for the time to mean revert using the half life test

  • Comparative Z-Scores for Stock Prices in Varying Sample Sizes

    A quick study of z-scores in varying samples of NVDA stock prices.

    I used python to generate these plots.

    animation of NVDA zscore scatter plot x , y are in standard deviation units, anything above 3 or below -3 is considered outliers in data science. DeepSeek release was nearly an outlier effect on NVDA value. Image shows last 15 days, 30 days, 45 days, 60 days and 93 days.

    Z-Scores for QQQ ETF Tech Index stock, for same time periods above for NVDA. In an index stock such as QQQ one should see a smoother spectrum as it is less susceptible to volatility.

    First, we see the last 93 trade days price chart. Then afterwards, we take a look at the 93, 60, 45, 30, 15 sample windows all going backward in time from Feb. 20th, 2025. One major petrubative wave that hit the stock was the release of DeepSeek which had a negative impact temporarily on NVDA value. The question is whether one can see a correlation to the action of the Index ETF for the sector NVDA is in and is a part of the portfolio. Examining the spread in the z-scores to see if one can tell if it is an indicator of up or down motion in relation to the index for NVDA stock prices.

    Comparing different plots of Z-scores for the price of stocks for NVDA form Oct 4th, 2024 to February 20, 2025.

    NVDA past 93 trading days. open price 124.92, close price 140.11 (gained 13%)

    QQQ ETF, of which NVDA is a member of the portfolio, past 93 trading days, open price: 487.32 close price: 537.23 (gained 10.2%)

    Z-Scores:

    z-score is calcualted as:

    93 Days:

    NVDA past 93 trading days zscores

    QQQ past 93 trading days

    I use the terms "prices" as code for NVDA and "trends" for QQQ, the index ETF for tech stocks. 
    
    Some Data for Z-scores:
    
    Shape of Z-score plots: 
    prices max/min, trends max/min:  1.7594241527219399 -2.9115388106292883 1.7722687913449942 -2.0281115555314564
    
    
    NVDA length and mean of positive and negative:
    prices positive len:  51 0.7123723838425315
    prices negative len:  43 -0.8449067808364951
    
    prices positive list:  [0.14019500216746938, 0.13021735775384247, 0.9441080663512041, 0.9270035330706985, 0.35257629040040717, 0.4737334011373164, 0.6348010895287374, 0.4894125566444479, 0.5934651341008516, 0.3205052904994612, 0.40246451246854575, 1.2149298432925333, 1.6810283751862907, 1.502856153514364, 1.1650416212243906, 1.5969310865571407, 1.30900477633531, 1.3788482872307024, 0.6975177115572552, 0.43667357902955695, 1.4144827315650879, 1.254840420947041, 1.3660198872703233, 0.6932415782371287, 0.16585180208822778, 0.2200161574764928, 0.4523527345366844, 1.147937087943885, 1.136534065756884, 0.7630850891325253, 0.2456729573972512, 0.3169418460660218, 0.03614242471106574, 0.36825544590753456, 0.44665122344318386, 0.40531526801529805, 0.057523091311697735, 0.17440406872848058, 1.052436777127734, 1.7594241527219399, 0.4352482012561788, 0.43097206793605647, 0.08888140232595665, 0.5335992676190859, 1.4230349982053405, 1.4444156648059727, 0.7887418890532837, 0.25137446849075173, 0.32977024602640104, 0.30553882387901676, 0.43097206793605647]
    
    prices negative list:  [-1.6826224330881445, -2.0281115555314564, -1.5515748349199896, -1.2967600607041379, -1.3338240278628077, -1.2828610730196344, -1.0088524586680414, -1.4489870686772424, -1.4450159293388134, -1.4225128064210513, -1.2093949952586989, -1.146518622400242, -1.1107783683543844, -1.6137893512220423, -1.3516941548857364, -1.1531371879642918, -1.1478423355130543, -0.8327986146643587, -1.082980392985381, -1.9122866581606137, -1.6753420109676906, -1.7693256419771755, -1.3589745770061903, -0.47407236109295114, -0.032614037970937836, -0.26823497205105246, -1.0704051184136898, -0.8420646064540289, -0.6143859510507724, -0.6335797911865096, -0.5137837544772378, -0.4601733734084476, -0.4072248488960653, -0.22719986555395358, -0.49260434467228414, -0.19874003362854603, -0.10012340672422786, -0.16630906236470946, -0.36751345551177483, -0.4753960742057596, -0.507165188913192]
    
    
    Trends (QQQ) length and mean for positive/negative:
    trends positive len:  53 0.7493882152106839
    trends negative len:  41 -0.9687213513699346
    
    trends positive list:  [0.05276545780528101, 0.09247685118957147, 0.07262115449743001, 0.011068494751777909, 0.1685903551761238, 0.2731636910880874, 0.6960900306307639, 0.6001208299520626, 0.9092078417931164, 0.6378466536671404, 0.520036186627085, 1.1368864971963693, 0.9105315549059249, 1.1772597471370638, 1.6829181562303424, 1.5260581523624006, 0.2466894288318963, 0.09446242085878412, 0.39163601468454706, 0.670277624930977, 1.1395339234219937, 1.1157070873914194, 0.6497600716824238, 0.18976976498108122, 0.3863411622333134, 0.7808076698505786, 0.16130993305566993, 0.16726664206331537, 0.2619121296292083, 0.022320056210664595, 0.595487834057233, 0.7980159403171032, 1.2421216896647371, 1.316911480538481, 1.117692657060632, 0.0971098470844085, 0.6001208299520626, 0.535258887424397, 0.6821910429462604, 0.631889944659495, 0.3552339040822852, 0.776174673955749, 0.9336965343800949, 1.1157070873914194, 0.6735869077129981, 1.0925421079172493, 1.009148181810243, 1.0296657350587888, 1.5326767179264504, 1.681594443117534, 1.7623409429989234, 1.7722687913449942, 1.6207036399282937]
    
    trends negative list:  [-1.6826224330881445, -2.0281115555314564, -1.5515748349199896, -1.2967600607041379, -1.3338240278628077, -1.2828610730196344, -1.0088524586680414, -1.4489870686772424, -1.4450159293388134, -1.4225128064210513, -1.2093949952586989, -1.146518622400242, -1.1107783683543844, -1.6137893512220423, -1.3516941548857364, -1.1531371879642918, -1.1478423355130543, -0.8327986146643587, -1.082980392985381, -1.9122866581606137, -1.6753420109676906, -1.7693256419771755, -1.3589745770061903, -0.47407236109295114, -0.032614037970937836, -0.26823497205105246, -1.0704051184136898, -0.8420646064540289, -0.6143859510507724, -0.6335797911865096, -0.5137837544772378, -0.4601733734084476, -0.4072248488960653, -0.22719986555395358, -0.49260434467228414, -0.19874003362854603, -0.10012340672422786, -0.16630906236470946, -0.36751345551177483, -0.4753960742057596, -0.507165188913192]
    
    zscore silos: 
    prices 0 to 1:  35 0.4056562939002893
    prices 1 to 2:  16 1.3722767337624457
    prices 2 to 3:  0 nan
    prices 3>:  0 nan
    prices 0 to -1:  31 -0.4417795985574785
    prices -1 to -2:  8 -1.4510292935080256
    prices -2 to 3:  4 -2.7127490308408726
    prices <3:  0 nan
    
    trends 0 to 1:  35 -0.27555447552077394
    trends 1 to 2:  17 1.344595325715943
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  20 -0.41098618215387406
    trends -1 to -2:  21 -1.3951694395818528
    trends -2 to 3:  1 -2.0790028405616057
    trends <3:  0 nan

    60 DAYS

    NVDA past 60 days open: 146.67, close: 140.11

    QQQ past 60 days open: 504.98, close: 537.23

    "prices" is for NVDA, "trends" is for QQQ
    
    some data for 60 day plots:
    
    
    
    shape of plots: 
    prices max/min, trends max/min:  1.9145807927766163 -2.645347442651133 1.8695586413001115 -1.7542554651846929
    
    
    
    prices positive len:  34 0.6792199553321576
    prices negative len:  26 -0.8882107108189714
    
    prices positive list:  [1.5305282468571573, 0.8737427335456234, 0.04858635771143208, 0.1738208835547299, 0.35888968285649864, 0.4117664826570029, 0.6385801239065363, 1.3176295529235456, 1.3064975950708098, 0.9419259753936441, 0.43681338782566403, 0.5063881244052757, 0.23226366228160591, 0.556481934742594, 0.6330141449801685, 0.5926607977639948, 0.18634433613906048, 0.2531360832554902, 0.36723865124605237, 1.2243994059068675, 1.9145807927766163, 0.6218821871274288, 0.617707702932656, 0.03327991566391562, 0.07919924180646105, 0.2837489673505192, 0.7178953236072966, 1.5861880361208474, 1.6070604570947318, 0.9669728805623052, 0.4423793667520319, 0.5189115769896063, 0.49525616655253607, 0.617707702932656]
    
    prices negative list:  [-1.7542554651846929, -1.669273142479056, -1.5853399842512712, -1.2999672462767844, -1.720682201893578, -1.2548531737293467, -0.6725668885240624, -0.506798901024174, -0.11546055078710396, -0.5487654801380664, -0.7900733100429681, -0.318998459489499, -0.6389936252329415, -1.098527666530088, -1.203444114314825, -0.32739177531227037, -0.6841076977803849, -0.6746652174797553, -1.5223901155804267, -1.6934039254695472, -1.7437638204062227, -0.524634697147575, -0.9044322381283226, -0.7858766521315705, -0.0913297677966126, -0.3767025057710995]
    
    trends positive len:  34 0.7207852536147648
    trends negative len:  26 -0.9425653316500864
    
    trends positive list:  [0.1636172003202925, 0.01148835103241942, 0.5014481621871478, 0.07129072626972463, 0.8623607425666361, 0.5035464911428407, 0.926359775715327, 1.7279214367907145, 1.4792694555408878, 0.12269978568424661, 0.8665574004780336, 0.828787479275527, 0.09017568687097201, 0.29791025348475275, 0.004144199687494526, 0.32518852990878333, 1.0291778945443708, 1.147733480541123, 0.8319349727090662, 0.01148835103241942, 0.14158474628549397, 0.06184824596909498, 0.2905661021398278, 0.5402672478675009, 0.828787479275527, 0.12794560807347868, 0.7920667225508667, 0.6598719983420994, 0.6923960971553621, 1.4897611003193638, 1.7258231078350217, 1.8538211741324033, 1.8695586413001115, 1.6292999758730682]
    
    trends negative list:  [-1.7542554651846929, -1.669273142479056, -1.5853399842512712, -1.2999672462767844, -1.720682201893578, -1.2548531737293467, -0.6725668885240624, -0.506798901024174, -0.11546055078710396, -0.5487654801380664, -0.7900733100429681, -0.318998459489499, -0.6389936252329415, -1.098527666530088, -1.203444114314825, -0.32739177531227037, -0.6841076977803849, -0.6746652174797553, -1.5223901155804267, -1.6934039254695472, -1.7437638204062227, -0.524634697147575, -0.9044322381283226, -0.7858766521315705, -0.0913297677966126, -0.3767025057710995]
    
    zscore silos: 
    prices 0 to 1:  26 0.4857168466015017
    prices 1 to 2:  7 1.4753238481642317
    prices 2 to 3:  0 nan
    prices 3>:  0 nan
    prices 0 to -1:  20 -0.42879927835305054
    prices -1 to -2:  3 -1.5721141625126993
    prices -2 to 3:  4 -2.4158942235473466
    prices <3:  0 nan
    
    trends 0 to 1:  23 -0.06220277723223607
    trends 1 to 2:  9 1.5519738719284009
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  18 -0.4898406138886327
    trends -1 to -2:  10 -1.5547354231811998
    trends -2 to 3:  0 nan
    trends <3:  0 nan

    45 DAYS

    NVDA past 45 days open: 134.25, close: 140.11

    QQQ past 45 days open: 530.53, close: 537.23

    some data for 45 day plot:
    
    
    
    shape of plots: 
    prices max/min, trends max/min:  2.0056164520976707 -2.335406431884398 1.7445748897642444 -2.004426294606925
    
    
    zscore means of prices and trends:  4.46309655899313e-15 -1.0288066694859784e-14
    
    NVDA prices length and mean of positive/negative:
    prices positive len:  24 0.7357314773777098
    prices negative len:  21 -0.8408359741459445
    
    prices positive list:  [0.054341899730996124, 0.712714915702643, 0.7855730965445773, 0.7471569648279229, 0.36034625926711406, 0.4239315807291669, 2.943764882851058e-05, 0.5325565048935021, 1.3485681303231487, 2.0056164520976707, 0.7749755429675673, 0.7710014603761928, 0.2146298975832493, 0.2583448060884106, 0.45307485306593986, 0.8663794425692682, 1.6929886215759211, 1.7128590345328127, 1.1034997038548302, 0.13249885736143355, 0.6040899915383078, 0.6769481723802421, 0.6544283710290971, 0.7710014603761928]
    
    prices negative list:  [-0.7645562745724367, -1.014925110927081, -0.5261615999565064, -0.06787777341171075, -0.10162313831168997, -0.8581724481659111, -1.334961797397784, -1.443817813204148, -0.5348700812210075, -0.9049805349626544, -0.8951834935400782, -1.7747401012554962, -1.9521754070198687, -2.004426294606925, -0.739519390936971, -1.1335781681560086, -0.19088507127290136, -1.010570870294818, -0.18326515016646283, -0.28994404565668896, -0.048283690566570704, -0.1310142625794062, -0.5860324086500015, -0.0624349726213975]
    
    pos/neg length and mean:
    trends positive len:  21 0.8835238047359075
    trends negative len:  24 -0.7730833291439385
    
    trends positive list:  [0.7659593076650326, 1.5976192684256507, 1.3396305109645679, 0.7039113786554122, 0.6647232129651197, 0.11391177298491237, 0.14221433709456596, 0.8726382031552712, 0.9956455010164618, 0.6679888934393077, 0.10629185187847384, 0.3653691694976192, 0.6647232129651197, 0.6266236074328899, 0.4894650275168725, 0.5232103924168393, 1.3505161125452068, 1.5954421481095253, 1.7282464873932923, 1.7445748897642444, 1.4952946135676752]
    
    trends negative list:  [-0.7645562745724367, -1.014925110927081, -0.5261615999565064, -0.06787777341171075, -0.10162313831168997, -0.8581724481659111, -1.334961797397784, -1.443817813204148, -0.5348700812210075, -0.9049805349626544, -0.8951834935400782, -1.7747401012554962, -1.9521754070198687, -2.004426294606925, -0.739519390936971, -1.1335781681560086, -0.19088507127290136, -1.010570870294818, -0.18326515016646283, -0.28994404565668896, -0.048283690566570704, -0.1310142625794062, -0.5860324086500015, -0.0624349726213975]
    
    zscore silos: 
    prices 0 to 1:  19 0.5159193883045129
    prices 1 to 2:  4 1.4639440396810512
    prices 2 to 3:  1 2.005079787550808
    prices 3>:  0 nan
    prices 0 to -1:  14 -0.38097024812529595
    prices -1 to -2:  4 -1.454678571075471
    prices -2 to 3:  3 -2.1703421886682137
    prices <3:  0 nan
    
    trends 0 to 1:  13 0.18729980332171914
    trends 1 to 2:  8 1.5002054504426652
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  16 -0.42194589789701936
    trends -1 to -2:  7 -1.3777869923741477
    trends -2 to 3:  1 -2.0051154449596136
    trends <3:  0 nan

    30 DAYS

    NVDA past 30 days, during which it took a major dive with the release of DeepSeek close: open: 140.14, close: 140.11

    QQQ past 30 days open: 515.18, close: 537.23

    Some Data for 30 day plot
    
    shape of zscore plot
    prices max/min, trends max/min:  1.7395624418627935 -2.0102455727613866 1.635662150597431 -1.980244163451975
    
    beats_condition, max, min:  False True False
    avg zscores list:  [-0.04900609322786642, -0.043237954955050006, -1.4069222487518587, -1.9069027691119136, -2.1376722944324356, -0.3679613081523311, -1.0756478274354047, 0.34156933192037153, 1.0456770404126563, 2.5158374261919283, 2.6528832614306106, 1.7724244766420925, -2.8159600623456074, -0.7210526618453241, -1.4730439252605447, -1.1233948714886564, -1.765168780240289, -2.622450849103493, -1.7105258315300362, -0.7023407103306647, 0.0587929352193316, -0.500214456769451, 0.6220642834292512, 0.39529360882055975, 0.2241537498518348, 1.531306979740637, 2.2043609778904045, 2.3999373023930133, 2.3948264891602498, 2.2623747818778503]
    zscore differential avg:  -4.4704980458239636e-15
    zscore means of prices and trends:  -7.919590908992784e-16 -3.796962744218036e-15
    
    trend count and mean of pos or neg:
    prices positive len:  17 0.7172287353811013
    prices negative len:  13 -0.9379145001137494
    
    prices positive list:  [0.8708241976370331, 0.8671431033818425, 0.35178990765469526, 0.022945487524039902, 0.3922819444618296, 0.06466455574957113, 0.5726555629663304, 0.9554893655064959, 1.7211569705868235, 1.7395624418627935, 1.1751279893997306, 0.06466455574957113, 0.2757139597140209, 0.7125371446636966, 0.780023872675586, 0.7591643385628187, 0.8671431033818425]
    
    prices negative list:  [-0.9198302908648995, -0.9103810583368925, -1.758712156406554, -1.9298482566359534, -1.980244163451975, -0.7602432526141607, -1.1403123831849757, -0.23108623104595882, -1.0216720192222633, -0.22373682796862931, -0.326628471051326, -0.09354740202724439, -0.1733409211526078, -0.6122052763421065, -0.10719629345658256]
    
    trend count and mean of pos or neg:
    trends positive len:  15 0.8125990002508011
    trends negative len:  15 -0.8125990002508086
    
    trends positive list:  [0.09018767490616038, 0.7946804556051048, 0.9133208195678171, 0.5972964872423618, 0.055540488970154546, 0.30542019359958456, 0.5941467430663635, 0.5573997276796802, 0.42511047228762966, 0.45765782877296995, 1.2555930200266159, 1.4918238332267078, 1.6199134297174271, 1.635662150597431, 1.395231678496008]
    
    trends negative list:  [-0.9198302908648995, -0.9103810583368925, -1.758712156406554, -1.9298482566359534, -1.980244163451975, -0.7602432526141607, -1.1403123831849757, -0.23108623104595882, -1.0216720192222633, -0.22373682796862931, -0.326628471051326, -0.09354740202724439, -0.1733409211526078, -0.6122052763421065, -0.10719629345658256]
    
    zscore silos: 
    prices 0 to 1:  14 0.5195964415593288
    prices 1 to 2:  3 1.5885045560705986
    prices 2 to 3:  0 nan
    prices 3>:  0 nan
    prices 0 to -1:  7 -0.38925651381809206
    prices -1 to -2:  5 -1.4606024370377964
    prices -2 to 3:  1 -2.0120560681267605
    prices <3:  0 nan
    
    trends 0 to 1:  11 0.23112599712881077
    trends 1 to 2:  5 1.4620954964259174
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  9 -0.4261916020115731
    trends -1 to -2:  4 -1.522227807745561
    trends -2 to 3:  1 -2.0472318717466145
    trends <3:  0 nan

    15 DAYS

    NVDA past 15 days during which price regained momentum and climbed back up. open: 124.65, min on day81 at 116.64, close: 140.11

    QQQ past 15 days open: 523.05, close: 537.23.

    Some Data for 15 day plot
    
    shape of z-score: 
    prices max/min, trends max/min:  1.2978590301422221 -1.7892515585278195 1.4875489718437718 -1.7015455028399435
    
    
    prices positive len:  8 0.7961212804164766
    prices negative len:  7 -0.9098528919045411
    prices positive list:  [0.4368908744960763, 0.3355230641218386, 0.11698986253581116, 0.6633228665008721, 1.1319844313480085, 1.2043900101867528, 1.1820101040002302, 1.2978590301422221]
    
    prices negative list:  [-0.9657152036835966, -1.0789198650922653, -1.7015455028399435, -0.7542012310515969, -0.3996918966402359, -0.9850791589245527, -0.04220349219180732, -0.22988490452723273, -0.18370931895265175]
    
    trends positive len:  6 1.056825095650698
    trends negative len:  9 -0.7045500637670981
    trends positive list:  [0.009930233456925725, 0.9483372951340528, 1.2834826743044578, 1.465205946565748, 1.4875489718437718, 1.1464454525992316]
    
    trends negative list:  [-0.9657152036835966, -1.0789198650922653, -1.7015455028399435, -0.7542012310515969, -0.3996918966402359, -0.9850791589245527, -0.04220349219180732, -0.22988490452723273, -0.18370931895265175]
    
    zscore silos: 
    prices 0 to 1:  5 0.3387271345868971
    prices 1 to 2:  4 1.13156977020328
    prices 2 to 3:  0 nan
    prices 3>:  0 nan
    prices 0 to -1:  3 -0.4156379102954067
    prices -1 to -2:  3 -1.6576670076204636
    prices -2 to 3:  0 nan
    prices <3:  0 nan
    
    trends 0 to 1:  1 0.584379796953797
    trends 1 to 2:  4 1.3465306598267066
    trends 2 to 3:  0 nan
    trends 3>:  0 nan
    trends 0 to -1:  7 -0.3399223000040687
    trends -1 to -2:  3 -1.315505544516218
    trends -2 to 3:  0 nan
    trends <3:  0 nan

    The above charts and data are generated in the following code snippets.

    This code snippet gets data into Pandas Dataframes from the Alpaca API.

    ############### INIT CEILLI CLASSES ####################     
    
    from classes.stock_list import StockList
    from classes.config import Config
    from classes.alpaca import Alpaca
    from classes.utilities import Utilities
    from classes.market_beat import MarketBeat
    from classes.profit_loss import ProfitLoss
    from classes.plots import Plots
    
    util = Utilities(pd.DataFrame())
    conf = Config(api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
    mb = MarketBeat(pd.DataFrame(), api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
    alpa = Alpaca(api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
    stocks = StockList()
    plots = Plots(pd.DataFrame())
    
    ############## SETTINGS ###################
    #CONSTANTS, see setting.toml for conflicts, set here to overide settings.toml file Constants
    ALGO_VERSION = conf.algo_version
    BASE_CURRENCY = conf.base_currency
    
    ############# LOGGING #################
    import logging
    logging.basicConfig(
        filename="logs/charts_"+ALGO_VERSION+".log",
        level=logging.INFO,
        format="%(asctime)s:%(levelname)s:%(message)s"
        )
    
    alpa = Alpaca(api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
    
    ############################### CONFIGS ###################################################
    # API Credentials alpaca4 edge 
    API_KEY = conf.api_key
    API_SECRET = conf.api_secret
    API_BASE_URL = conf.api_base_url
    SECRET_KEY = API_SECRET
    
    #CONSTANTS
    TIMEZONE_OFFSET = -4.0 #set in config file, this is deprecated, i think
    
    if DEBUG:
        PROCESS_ROWS = 0  #set to low number for debugging, otherwise 1000
    else:
        PROCESS_ROWS = 1000
    
    
    
    ########################### DRIVER #######################################
    date = DATE
    
    from datetime import date
    from datetime import timedelta
    import datetime
    from datetime import datetime, timezone, timedelta
    
    N_DAYS_AGO = 500
    YESTERDAY = 1
    #today = datetime.now()
    today = date.today()    
    n_days_ago = today - timedelta(days=N_DAYS_AGO)
    one_day_ago = today - timedelta(days=YESTERDAY)
    
    today = date.today()
    timezone_offset = -4 # EST is -4, that is 4 hours behind GMT
    tzinfo = timezone(timedelta(hours=timezone_offset))
    now = datetime.now(tzinfo)
    back_time = now - timedelta(minutes=15)
    date = back_time.strftime("%Y-%m-%d %H:%M:%S")
    start_time = now - timedelta(minutes=45)
    start = start_time.strftime("%Y-%m-%d %H:%M:%S")
    end = date
    
    beg_date = str(n_days_ago) + ' 00:00:00'
    end_date = str(one_day_ago) + ' 23:59:00'
    
    
    
    
    
    if MODE == 'SCREENER' or MODE == 'HISTORICAL':
        try:
    
            #STOCK_LIST = stocks.TECH_AL
            STOCK_LIST = ['NVDA', 'MSFT']
            
    
            
            STOCK_SET = set(STOCK_LIST) #remove duplicates from list
            STOCK_LIST = list(STOCK_SET)
            STOCK_LIST = sorted(STOCK_LIST)
            symbol_list = STOCK_LIST
    
            
            index_symbol = stocks.stock_index(ALGO_VERSION)
            
    
            cnt = 0
            for symbol in symbol_list:
                print(ALGO_VERSION)
                print(symbol)
                print(index_symbol)
                
    
                hundred_dates = alpa.get_calendar(str(n_days_ago), str(one_day_ago))
    
    
                #get prices for symbol in trading list
                symbol_price_data = alpa.stockbars_by_symbol_by_day(symbol, beg_date, end_date)
                symbol_price_data = symbol_price_data.reset_index(level=("symbol", "timestamp"))
                prices_data = symbol_price_data
                #get prices for trend index for symbol above
                index_price_data = alpa.stockbars_by_symbol_by_day(index_symbol, beg_date, end_date)
                index_price_data = index_price_data.reset_index(level=("symbol", "timestamp")) #alpaca dataframe return has an index of symbol, timestamp format
                column_names = index_price_data.columns
                trends_data = index_price_data
                symbol_prices = symbol_price_data
    
    
                column_names = prices_data.columns
                print(column_names)
                prices_data = symbol_prices[['timestamp', 'symbol', 'open', 'close', 'vwap']].copy()
                trends_data = trends_data[['timestamp', 'symbol', 'open', 'close', 'vwap']].copy()
                prices_data.rename(columns = {'timestamp':'date'}, inplace = True)
                trends_data.rename(columns = {'timestamp':'date'}, inplace = True)
    
                #prices_data = prices_data.reset_index()
                #trends_data = trends_data.reset_index()
                date_stamp = prices_data.iloc[0]['date']
        
    
    
                print()
                print()
                print("Statistical Analysis: ")
    
    
                prices_arr = np.array(prices_data['close'])
    
                from scipy.stats import skew, kurtosis 
                # Calculate the skewness 
                print("Symbol Prices skew: ")
                print(skew(prices_data['close'], axis=0, bias=True))
                print("Index skew: ")
                print(skew(trends_data['close'], axis=0, bias=True))
    
                # Calculate the kurtosis 
                print("Symbol Prices kurtosis: ")
                print(kurtosis(prices_data['close'], axis=0, bias=True))
                print("Index kurtosis: ")
                print(kurtosis(trends_data['close'], axis=0, bias=True))
    
                print()
                print("Covariance between the two: ")
                cov_matrix = np.stack((prices_data['close'], trends_data['close']), axis = 0) 
                print(np.cov(cov_matrix))
    
                print()
                print("Correlation between the two: ")
                correlations = np.correlate(prices_data['close'], trends_data['close']) 
                print(correlations)
    
                print()
                print()
                
                print("Mean of the Symbol: ")
                data_mean = np.mean(prices_data['close'])
                data_max = max(prices_data['close'])
                data_min = min(prices_data['close'])
                print("mean is: " + str(data_mean))
                print("max/min is: "+str(max(prices_data['close'])), str(min(prices_data['close'])))
                
                print()
                print()
    
                print("Variance of the Symbol: ")      
                m = sum(prices_data['close']) / len(prices_data['close'])
                std_dev = np.std(prices_data['close'])
                print("std dev: "+str(std_dev))
    
                import scipy.stats as scipy
                zscore_list = scipy.zscore(prices_data['close'])
                print("symbol z-scores list: ")
                print(zscore_list)
                trends_zscore_list = scipy.zscore(trends_data['close'])
                print("trends z-scores list: ")
                print(trends_zscore_list)
    
                
                import statistics
                # Calculate the variance from a sample of data
                data_variance = statistics.variance(prices_data['close'])
                print("variance result: "+str(data_variance))
    
                print()
                print()
    
                print("Market Beat Metrics: ")
                #prices = prices_data.iloc[:lookback_period]
                vars, vibe_check = mb.compare_rates(trends_data, prices_data)
                vars15, vibe_check15 = mb.compare_rates(trends_data[-15:], prices_data[-15:])
                vars30, vibe_check30 = mb.compare_rates(trends_data[-30:], prices_data[-30:])
                vars45, vibe_check45 = mb.compare_rates(trends_data[-45:], prices_data[-45:])
                vars60, vibe_check60 = mb.compare_rates(trends_data[-60:], prices_data[-60:])
    
                print(vars)
    
    

    the zscores are put into silos based on standard deviation in a Market Beat Class function, a snippet from that is following, which appends the zscore value to a list based on each silo or bin, I included the logic here because it can be beneficial to be able to sort these ito bins:

                 
                    if current_idx_z >= 0:
                        trends_positive.append(current_idx_z)
                    else:
                        trends_negative.append(current_idx_z)
    
                    if current_price_z >= 0:
                        prices_positive.append(current_price_z)
                    else:
                        prices_negative.append(current_price_z)
    
    
    
                    if current_price_z > 0 and current_price_z < 1:
                          prices_0to1.append(current_price_z)
                    elif current_price_z > 1 and current_price_z < 2 :
                          prices_1to2.append(current_price_z)
                    elif current_price_z > 2 and current_price_z < 3:
                          prices_2to3.append(current_price_z)
                    elif current_price_z > 3 and current_price_z < 8:
                          prices_3up.append(current_price_z)
                    elif current_price_z <= 0 and current_price_z > -1:
                          prices_0toneg1.append(current_price_z)
                    elif current_price_z <= 1 and current_price_z > -2:
                          prices_neg1toneg2.append(current_price_z)
                    elif current_price_z <= 2 and current_price_z > -3:
                          prices_neg2toneg3.append(current_price_z)
                    elif current_price_z <= 3:
                          prices_neg3.append(current_price_z)
    
    
                    if current_idx_z >= 0 and current_idx_z < 1:
                          trends_0to1.append(current_price_z)
                    elif current_idx_z >= 1 and current_idx_z < 2:
                          trends_1to2.append(current_idx_z)
                    elif current_idx_z >= 2 and current_idx_z < 3:
                          trends_2to3.append(current_idx_z)
                    elif current_idx_z >= 3:
                          trends_3up.append(current_idx_z)
                    elif current_idx_z < 0 and current_idx_z > -1:
                          trends_0toneg1.append(current_idx_z)
                    elif current_idx_z < 1 and current_idx_z > -2:
                          trends_neg1toneg2.append(current_idx_z)
                    elif current_idx_z < 2 and current_idx_z > -3:
                          trends_neg2toneg3.append(current_idx_z)
                    elif current_idx_z < -3 and current_idx_z > -8:
                          trends_neg3.append(current_idx_z)
                                  

    The graphing part is handled in a Plots Class that is called by this code:

    
                path_15_index = 'plots/stats/zscores/scatter/'+str(today)+'_'+index_symbol+'_15.png'
                print(path_15_index)
                isFile = os.path.isfile(path_15_index)
                if isFile == False:
                    symbol_zscores_plot = plots.zscores_scatter_by_day(today, index_symbol, trends_data['zscores'][-15:], '15')
                else:
                    print(index_symbol + ' zscores scatter plot file exists for this date')
                

    Then in the plots class I generate the plots:

    in the Plots Class:
    
    
    
        def zscores_scatter_by_day(self, plot_date, symbol, data, periodicity='all'):
                plot_date = str(plot_date)
                zscores = data.reset_index(drop = True)
        
                #zscores = zscores.tolist()
                print(zscores)
                # PLOTTING
                import matplotlib.pyplot as plt
                zscores_set = set(zscores) #remove duplicates from list
                zscores_list = list(zscores_set)
                zscores_list = sorted(zscores_list)
                print("zscores sorted and unique: ", zscores_list)
                import seaborn as sns
                sns.displot(zscores_list, color="maroon")
                plt.xlabel("zscore", labelpad=14)
                plt.ylabel("probability of occurence", labelpad=14)
                plt.title("Percent Ratio Z-scores distribution" + plot_date, y=1.015, fontsize=10);
                #plt.show()
                plt.savefig('plots/stats/zscores/'+symbol+'_'+str(plot_date)+'_'+periodicity+'.png',bbox_inches='tight')
                plt.clf()
    
                import matplotlib.pyplot as plt2
    
                x_cnt = 0
                color = 'grey'
                # https://matplotlib.org/stable/gallery/color/named_colors.html
                for i in zscores:
                    if i < 0 and i > -1:
                        color = 'orange'
                    elif i < -1 and i > -2:
                        color = 'indianred'
                    elif i < -2 and i > -3:
                        color = 'firebrick'
                    elif i < -3 and i > -4:
                        color = 'maroon'
    
                    elif i > 0 and i < 1:
                        color = 'yellow'
                    elif i > 1 and i < 2:
                        color = 'green' 
                    elif i > 2 and i < 3:
                        color = 'forestgreen'
                    elif i > 3 and i < 4:
                        color = 'darkgreen' 
                    elif i > 4 and i < 5:
                        color = 'darkolivegreen'
                    elif i > 5:
                        color = 'black' 
    
                    print(zscores)
    
                    plt2.scatter(i, zscores[x_cnt], c=color) 
                    x_cnt += 1
                # depict first scatted plot
                #plt.scatter(x, y, c='blue')
                print('plots/stats/zscores/scatter/'+str(plot_date)+'_'+symbol+'_'+periodicity+'.png')
                plt2.savefig('plots/stats/zscores/scatter/'+str(plot_date)+'_'+symbol+'_'+periodicity+'.png',bbox_inches='tight')
                plt2.clf()
                # depict illustration
                #plt.show()
    
    

    this function outputs the plots into a directory for safe keeping and reference as needed. The first part of the function generates the zscore bar charts and the second part of the function generates the rainbow spectrum charts of zscores. You have to pass in a dataframe of zscores to be plotted, plus the other apparent variables that are easy to figure out for oneself. You’ll need to include these libraries in your own code for this function to work.

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import plotly.graph_objects as go
    import plotly.express as px
    from plotly.subplots import make_subplots
    from plotly.offline import iplot, init_notebook_mode
    
    import seaborn as sns
    
        def zscores_scatter_by_day(self, plot_date, symbol, data, periodicity='all'):
    ...
    ...
    ...

    I hope this can provide some insights to others on how to plot z-scores and work with pandas.

  • Statistical Awareness for Data Science

    One of the most fundamental levels of working with Machines is an understanding of statistics, which is the basis of most contemporary methods in computer science. Yet, it can be an impenetrable subject that does involve a committed level of aspiration to try and pierce through the fuzzy layers with maths to turn the typical person off to the subject. This is not an attempt to make it easier to understand this subject but is an abbreviated attempt to get to the meat of the matter at hand.

    Statistics always begins with a discussion of probability and the two concepts of probability: statistical probability and inductive probability. The first being based in quantitative understandings and the latter in qualitative understandings. The first is the one we deal with in ML and Software Engineering. The quantitative stats probability deals with two main laws of probs: addition and multiplication.

    A. The Law of Addition

    “The law of addition of probabilities states that if A and B are mutually exclusive events, that is if they cannot both occur together, then the probability that either A or B will occur is equal to the sum of their separate probabilities: in symbols: P(A or B) = P(A) + P(B)

    This follows from the fact, that if A and B are mutually exclusive, the number of times on which either A or B has occurred is the number of times on which A has occurred plus the number of times on which B has occurred; the same must therefore be true of the corresponding proportions and so, as the number of observations increases, of their limiting values or probabilities. This law can be exteded to any number of events, provided they are all mutually exclusive.” (Bulmer, 1979, 12-3)

    One way to understand this as a coder is to actually code these formulas in a theoretical example. The code we use in this work is based on Python, see my book PlayAI: Machine Learning in Video Game Design for an introduction into setting up a Python development environment if you are not familiar with Python, we will be using DataFrames to work with statistical data, which is what you will use in the real world to work with Machine Learning, although you can also use R language.

    Python example:

    In this example we are working with the second rule of addition in stats.

    # P(A or B) = P(A) + P(B) – P(A and B)

    In a small survey, 32 people responded to the question “Is a hotdog a sandwich?”.

    • 50% of the respondents were female and the rest were male.
    • 11 people responded with ‘yes’ and 21 responded with ‘no’.
    • Of the female participants who took the survey, 5 responded with ‘yes’.

    Below is a DataFrame representing all the responses to the survey:

    import pandas as pd
    
    # creating a Dataframe named 'survey' with 'gender' and 'response' columns
    survey = pd.DataFrame([{'gender': 'female', 'response': 'no'}, {'gender': 'male', 'response': 'yes'}, {'gender': 'female', 'response': 'no'},{'gender': 'female', 'response': 'yes'}, {'gender': 'female', 'response': 'no'}, {'gender': 'female', 'response': 'no'},
    {'gender': 'male', 'response': 'no'}, {'gender': 'female', 'response': 'no'}, {'gender': 'male', 'response': 'no'},{'gender': 'male', 'response': 'yes'}, {'gender': 'male', 'response': 'yes'}, {'gender': 'male', 'response': 'no'},{'gender': 'female', 'response': 'no'}, {'gender': 'female', 'response': 'no'}, {'gender': 'male', 'response': 'no'},{'gender': 'female', 'response': 'yes'}, {'gender': 'male', 'response': 'yes'}, {'gender': 'female', 'response': 'no'},
    {'gender': 'male', 'response': 'yes'}, {'gender': 'male', 'response': 'yes'}, {'gender': 'female', 'response': 'no'},{'gender': 'male', 'response': 'no'}, {'gender': 'female', 'response': 'no'}, {'gender': 'female', 'response': 'yes'},{'gender': 'female', 'response': 'yes'}, {'gender': 'male', 'response': 'no'}, {'gender': 'male', 'response': 'no'},{'gender': 'male', 'response': 'no'}, {'gender': 'female', 'response': 'no'}, {'gender': 'male', 'response': 'no'},{'gender': 'male', 'response': 'no'}, {'gender': 'female', 'response': 'yes'}])
    survey
    
    prob_female_or_yes = len(survey[(survey['gender'] == 'female') (survey['response'] == 'yes')]) / len(survey)
    prob_female_or_yes
    
    P(female) = 0.5
    P(yes) = 11⁄32 = 0.3438
    P(female and yes)
    = P(yes | female) * P(female)
    = 5⁄16 * 0.5 = 0.1563
    
    P(female or yes)
    = 0.5 + 0.34375 - 0.15625
    = 0.6875
    
    

    (source: https://discovery.cs.illinois.edu/guides/Probability/addition-rule/)

    the probability is .6875

    B. The Law of Multiplication

    “The law of multiplication of probabilities states that if A and B are two events, then the probability that both A and B will occur is equal to the probability that A will occur multiplied by the conditional probability that B will occur given that A has occurred, or in symbols P(A and B) = P(A) x (P(B | A)

    Caroline and Victor are feeling adventurous, and want to travel twice this year. However, they are feeling indecisive, and put all of the places they are considering on slips of paper in a hat. Caroline randomly selects 2 slips of paper with destinations from the hat without replacement. What is the probability that both destinations are in Europe?

    Hand Calculations

    The question asks us to find the probability of both events occurring, so we know we need to use the multiplication rule. However, since Caroline is selecting from the hat without replacement, we know the events are dependent. Therefore, we need to use the formula P(A and B) = P(A) * P(B|A) where:

    • P(A) is the probability Caroline chooses a destination in Europe on her first selection. This is the total amount of destinations in Europe (7) divided by the total amount of destinations (21), 7⁄21 or approximately 33.33% or approximately 0.3333.
    • P(B) is the probability Caroline chooses a destination in Europe on her second selection.
    • P(B|A) is the probability Caroline chooses a destination in Europe on her second selection given she chose Europe on her first selection. This is the total amount of destinations in Europe divided by the total amount of destinations after choosing 1 Europe slip, 6⁄20 or 30% or 0.3.

    Therefore:

    • P(A and B) = P(A) * P(B|A) = 7⁄21 * 6⁄20 = 0.1 or 10%
    • The probability that Caroline chooses 2 slips of paper with destinations in Europe from a hat without replacement is 10% (0.1).


    import pandas as pd
    # Creating a DataFrame named 'destinations' with 'city', 'country',and 'continent' columns.
    destinations = pd.DataFrame([
        {'city': 'Bali', 'country': 'Indonesia', 'continent': 'Asia'},
        {'city': 'New Orleans', 'country': 'USA', 'continent': 'North America'},
        {'city': 'Dublin', 'country': 'Ireland', 'continent': 'Europe'},
        {'city': 'Sydney', 'country': 'Australia', 'continent': 'Australia'},
        {'city': 'Paris', 'country': 'France', 'continent': 'Europe'},
        {'city': 'Cape Town', 'country': 'South Africa', 'continent': 'Africa'},
        {'city': 'New York City', 'country': 'USA', 'continent': 'North America'},
        {'city': 'Rome', 'country': 'Italy', 'continent': 'Europe'},
        {'city': 'London', 'country': 'England', 'continent': 'Europe'},
        {'city': 'Tokyo', 'country': 'Japan', 'continent': 'Asia'},
        {'city': 'Barcelona', 'country': 'Spain', 'continent': 'Europe'},
        {'city': 'Rio de Janeiro', 'country': 'Brazil', 'continent': 'South America'},
        {'city': 'Agra', 'country': 'India', 'continent': 'Asia'},
        {'city': 'Cairo', 'country': 'Egypt', 'continent': 'Africa'},
        {'city': 'Berlin', 'country': 'Germany', 'continent': 'Europe'},
        {'city': 'Toronto', 'country': 'Canada', 'continent': 'North America'},
        {'city': 'Seoul', 'country': 'South Korea', 'continent': 'Asia'},
        {'city': 'Lima', 'country': 'Peru', 'continent': 'South America'},
        {'city': 'Copenhagen', 'country': 'Denmark', 'continent': 'Europe'},
        {'city': 'Beijing', 'country': 'China', 'continent': 'Asia'},
        {'city': 'Chicago', 'country': 'Illinois', 'continent': 'North America'}
    ])
    destinations
    
    # probability of selecting Europe with 1st slip
    prob_europe1 =len(destinations[destinations['continent'] == 'Europe']) / len(destinations)
    
    # probability of selecting Europe with 2nd slip
    prob_europe2 =(len(destinations[destinations['continent'] == 'Europe']) - 1) / (len(destinations) - 1)
    
    # answer
    prob_2_europe_slips = prob_europe1 * prob_europe2
    prob_2_europe_slips
    

    Probability is 0.1

    (source: https://discovery.cs.illinois.edu/guides/Probability/multiplication-rule/ )

    3. Random Variables and Probability Distributions

    In Machine Learning we are solving for X by features Y, X is refered to as a random variable- numerical variable which takes different values with different probabilities.

    A simple example of this process is flipping coins:

    X = number of heads. [X is a random variable or function]

    Here, the sample space S = {HH, HT, TH, TT}

    where the sample space is merely the different combinations of features to give us X, a numerical variable which takes different values with different probabilities.

    Two basic components to understanding sample space and finding X are Frequency distributions and probability distribution.

    Frequency distribution- the representation of different values across all results

    (Source: https://www.geeksforgeeks.org/frequency-distribution )

    Probability distribution- the distribution of the frequencies over the bandwidth of events

    (Source: https://datasciencedojo.com/blog/probability-distributions-in-data-science/ )

    cumulative probability Function F(x): the probability that X is less than or equal to some particular value x:


    the cumulative probability function can clearly be calculated by summing the probabilities of all values less than or equal to x:

    discrete random variable will be a step function a continuous random variable are based on measurements not fixed values, the measurement occurs in a range and the graph looks continuous.

    For example, Suppose a die is thrown (X = outcome of the dice).

    Here, the sample space S = {1, 2, 3, 4, 5, 6}.
    The output of the function will be:

    • P(X = 1) = 1/6
    • P(X = 2) = 1/6
    • P(X = 3) = 1/6
    • P(X = 4) = 1/6
    • P(X = 5) = 1/6
    • P(X = 6) = 1/6

    This also satisfies the condition ∑6i=1 P(X = i) = 1, since:
    P(X = 1) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) + P(X = 6) = 6 × 1/6 = 1

    Probability density function- f(x) [not F(x)] The area under the density function between any two points, x1 and x2, that is to say the integral of the function between them, represents the probability that the random variable will lie between these two values:

    If dx is a very small intrement in x, so small that the density function is practically constant between x and x+dx, then probability that X will lie in this small interval is very neearly f(x)dx, which is the area of a rectangle with height f(x) and width dx. f(x) may therefore be thought of as representing the probability density at x.

    A continuous probability distribution can also be represented by its cumulative probability function F(x), which, as in the discrete case, specifies the probability that X is less than or equal to x and which is the limiting form of the cumulative frquency diagrm shwoing the proportion of observations up to a given value.

    Frequency distribution and probability distribution are two fundamental concepts in statistics that describe how data points are spread across different values or ranges.

    Frequency Distribution

    frequency distribution is a summary of how often each value or range of values occurs in a dataset. It is typically represented in a table or a graph, showing the frequency (count) of each unique value or interval.

    Example of Frequency Distribution

    Consider a dataset representing the number of pets owned by 20 households:

    • Data: 1, 2, 2, 3, 1, 0, 2, 3, 1, 4, 0, 2, 1, 3, 0, 2, 1, 4, 2, 3

    You can create a frequency distribution table:

    | Number of Pets (x) | Frequency (f) |
    |———————|—————|
    | 0 | 3 |
    | 1 | 4 |
    | 2 | 6 |
    | 3 | 4 |
    | 4 | 2 |

    This table shows how many households have 0, 1, 2, 3, or 4 pets.

    Probability Distribution

    probability distribution describes the likelihood of each possible value occurring in a random variable. It assigns a probability to each value, where the sum of all probabilities equals 1. Probability distributions can be discrete or continuous.

    Example of Probability Distribution

    Using the same dataset of pet ownership, we can convert the frequency distribution into a probability distribution:

    1. Calculate the total number of households: N=20N=20.
    2. Calculate the probability for each number of pets:

    | Number of Pets (x) | Frequency (f) | Probability (P) = f/N |
    |———————|—————|————————|
    | 0 | 3 | 3/20 = 0.15 |
    | 1 | 4 | 4/20 = 0.20 |
    | 2 | 6 | 6/20 = 0.30 |
    | 3 | 4 | 4/20 = 0.20 |
    | 4 | 2 | 2/20 = 0.10 |

    This table shows the probability of randomly selecting a household with a certain number of pets. The probabilities add up to 1:

    0.15+0.20+0.30+0.20+0.10=1.000.15+0.20+0.30+0.20+0.10=1.00

    Summary

    • Frequency Distribution: Counts occurrences of each value in a dataset.
    • Probability Distribution: Assigns probabilities to each value, summing to 1.

    Both distributions are essential for understanding data and making statistical inferences.

    (Source: https://www.quora.com/What-is-frequency-distribution-and-probability-distribution-with-examples )

    Distributions is an important topic in ML and Data Science. In the next section we will take a look at the many different distributions.



    Works Cited:

    Bullmer, M.G. (1979) Principles of Statistics Dover Publications