One of the projects I am working on is a fintech application–Céillí. One of the data points that are presented for the stock trading community is that of Earnings Per Share (EPS). It’s a key financial metric that represents the portion of a company’s profit allocated to each outstanding share of common stock. EPS is calculated by dividing the company’s net income (after preferred dividends, if any) by the average number of outstanding common shares during the period. It’s commonly used to gauge a company’s profitability and is a vital input for valuation metrics like the price-to-earnings (P/E) ratio.
Below is an example Python script that calculates the Earnings Per Share (EPS) for a commodity-producing company (or any company) and then uses that EPS value to compute the Price-to-Earnings (P/E) ratio.you should have the net income and outstanding shares data to calculate this:
def calculate_eps(net_income: float, num_shares: float) -> float:
"""
Calculate Earnings Per Share (EPS).
Args:
net_income (float): The company's net income.
num_shares (float): The average number of outstanding shares.
Returns:
float: EPS value.
"""
if num_shares == 0:
raise ValueError("Number of shares cannot be zero")
return net_income / num_shares
getting EPS is a cumbersome task as the data is not in a freely available API, and those that do have that data charge a hefty fee which can eat into your returns, this informatino is usually contained in transcripts of earnings reports and general market reports in the media.
For my acquisition of EPS data I rely on spidering news headlines which I also use for sentiment analysis purposes. I noticed that much of the information needed for fundamental analysis, rather then technical analysis, are contained in the news headlines as well as the transcripts of earnings calls where companies present their productive value at quarterly and annual conference calls. Usually, one would rely on REGEX to extract information from text. Now, with the advent of Large Language Models and Generative AI, this task can become alot less cumbersome and automated. One way to do this is to use python and a LLM that is hosted by Huggingface and developed by NuMind, https://huggingface.co/numind/NuExtract-1.5-tiny
At this link you can find more code samples using python. For my purposes, I used this code to extract EPS from the text. How this basically works is that you use a json model to structure the data and the LLM is able to use this to place key datapoints from the text into this model, which of course could easily be automatically pushed to a database for any UI needs you may have.
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
def predict_NuExtract(model, tokenizer, text, schema, examples=["", "", ""]):
# Parse and reformat the schema
schema = json.dumps(json.loads(schema), indent=4)
input_llm = "<|input|>\n" + schema + "\n"
# Only add examples if they are non-empty valid JSON strings
for ex in examples:
if ex.strip(): # only process if not empty
input_llm += json.dumps(json.loads(ex), indent=4) + "\n"
# Add the text to extract data from
input_llm += "### Text:\n" + text + "\n<|output|>\n"
# Tokenize and generate output
input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to(device)
#output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
output = tokenizer.decode(
model.generate(**input_ids, use_cache=False)[0], skip_special_tokens=True)
return output.split("<|output|>")[1].split("<|end-output|>")[0]
model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-1.5-tiny", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract-1.5-tiny", trust_remote_code=True)
model.to(device)
#device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model.eval()
text = ["Relmada Therapeutics Q4 2024 GAAP EPS $(0.62) Beats $(0.70) Estimate.",
"Clearside Biomedical Q4 2024 GAAP EPS $(0.10), Inline, Sales $306.00K Beat $176.67K Estimate.",
"Argan Q4 2024 GAAP EPS $2.22 Beats $1.15 Estimate, Sales $232.474M Beat $197.500M Estimate.",
"Plus Therapeutics FY24 EPS $(1.95) Vs. $(4.24) YoY, Grant Revenue $5.8M Up From $4.9M YoY",
"SeaStar Medical Holding Q4 EPS $(0.90) Misses $(0.89) Estimate, Sales $67.00K Miss $150.00K Estimate.",
"Pulse Biosciences Q4 EPS $(0.31) Down From $(0.21) YoY.",
"CalAmp FY 2024 GAAP EPS $(11.04), Inline.",
"VirTra Q4 2024 GAAP EPS $(0.08) Misses $0.04 Estimate, Sales $5.40M Miss $7.45M Estimate.",
"Better Choice Q4 EPS $(0.50), Sales $7.2M Up 26% From YoY."]
schema = """{
"company": "",
"period": "",
"eps_data": {
"eps_type": "",
"actual_eps": "",
"eps_estimate": "",
"eps_result": ""
},
"sales_data": {
"actual_sales": "",
"sales_estimate": "",
"sales_result": ""
}
}"""
for i in text:
prediction = predict_NuExtract(model, tokenizer, i, schema)
print(prediction)
'''
Output:
{
"company": "Relmada Therapeutics",
"period": "Q4 2024",
"eps_data": {
"eps_type": "GAAP",
"actual_eps": "0.62",
"eps_estimate": "0.70",
"eps_result": "$(0.62)"
},
"sales_data": {
"actual_sales": "",
"sales_estimate": "",
"sales_result": ""
}
}
Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
{
"company": "Clearside Biomedical",
"period": "Q4 2024",
"eps_data": {
"eps_type": "GAAP",
"actual_eps": "0.10",
"eps_estimate": "176.67K",
"eps_result": ""
},
"sales_data": {
"actual_sales": "$306.00K",
"sales_estimate": "$176.67K",
"sales_result": ""
}
}
Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
{
"company": "Argan",
"period": "Q4 2024",
"eps_data": {
"eps_type": "GAAP",
"actual_eps": "$2.22",
"eps_estimate": "$1.15",
"eps_result": "$232.474M"
},
"sales_data": {
"actual_sales": "$232.474M",
"sales_estimate": "$197.500M",
"sales_result": ""
}
}
Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
{
"company": "Plus Therapeutics",
"period": "FY24",
"eps_data": {
"eps_type": "EPS",
"actual_eps": "1.95",
"eps_estimate": "4.24",
"eps_result": ""
},
"sales_data": {
"actual_sales": "5.8M",
"sales_estimate": "",
"sales_result": ""
}
}
Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
{
"company": "SeaStar Medical Holding",
"period": "Q4",
"eps_data": {
"eps_type": "",
"actual_eps": "0.90",
"eps_estimate": "0.89",
"eps_result": "Misses"
},
"sales_data": {
"actual_sales": "$67.00K",
"sales_estimate": "$150.00K",
"sales_result": ""
}
}
Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
{
"company": "Pulse Biosciences",
"period": "Q4",
"eps_data": {
"eps_type": "EPS",
"actual_eps": "0.31",
"eps_estimate": "0.21",
"eps_result": "Down From"
},
"sales_data": {
"actual_sales": "",
"sales_estimate": "",
"sales_result": ""
}
}
Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
{
"company": "CalAmp",
"period": "FY 2024",
"eps_data": {
"eps_type": "GAAP",
"actual_eps": "11.04",
"eps_estimate": "",
"eps_result": ""
},
"sales_data": {
"actual_sales": "",
"sales_estimate": "",
"sales_result": ""
}
}
Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
{
"company": "VirTra",
"period": "Q4 2024",
"eps_data": {
"eps_type": "GAAP",
"actual_eps": "0.08",
"eps_estimate": "0.04",
"eps_result": "Misses"
},
"sales_data": {
"actual_sales": "$5.40M",
"sales_estimate": "$7.45M",
"sales_result": ""
}
}
Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.
{
"company": "Better Choice",
"period": "Q4",
"eps_data": {
"eps_type": "EPS",
"actual_eps": "0.50",
"eps_estimate": "",
"eps_result": ""
},
"sales_data": {
"actual_sales": "$7.2M",
"sales_estimate": "",
"sales_result": "Up 26% From YoY"
}
}
'''
As one can see results can vary, although one could introduce steps into this basic process to further refine the models or even use case-based models that are specific to different patterns of headlines, as there are set ways that financial data is presented in the market news. Other financial ratios are also reported in the news, so having such an ability can help, such as collecting EBITDA, P/E, etc.
OpenAI’s chatGPT had this to add to EPS discussion that I found interesting:
me: is there an optimal eps value or is it relative to each stock? what constitutes a miss?
ChatGPT said:
Reasoned about optimal EPS and “Miss” for a couple of seconds
EPS (Earnings Per Share) isn’t a one-size-fits-all metric—it’s relative to the company, its industry, historical performance, and market expectations. There isn’t an “optimal” EPS value that applies across the board. Instead, investors look at trends, growth rates, and how the current EPS compares to analysts’ forecasts or the company’s own guidance.
What Constitutes a “Miss”
A “miss” typically occurs when a company’s reported EPS is below the consensus estimate or its internal guidance. Analysts and investors use these estimates as benchmarks, so if the actual EPS falls short, it’s seen as underperformance—even if the EPS might be strong in absolute terms. The severity of the miss can affect investor sentiment and, consequently, the stock’s price.
Key Considerations
Relative Performance: For one company, an EPS of, say, $2.00 might be excellent if the industry average is $1.00, while another company might be expected to have an EPS of $5.00 based on its size and market position.
Growth Trends: Investors often consider not just the current EPS but also its trend over time. Consistent growth can be more important than a single high or low number.
Market Expectations: EPS is closely watched around earnings announcements. A miss relative to consensus estimates can trigger a negative reaction, even if the company is profitable overall.
In summary, the “optimal” EPS is relative and context-dependent, and a miss is typically defined by falling short of pre-determined expectations rather than a fixed numeric benchmark.
me: if i have the eps for a company how do i get the P/E, what is the price i use?
ChatGPT said:
Reasoned for 10 seconds
The Price/Earnings (P/E) ratio is calculated by dividing the current market price of a company’s stock by its earnings per share (EPS). So if you have the EPS, the price you use is typically the current share price—often the last traded or closing price.
For example, if a company’s EPS is $2.00 and its current share price is $30.00, then the P/E ratio is calculated as:P/E=Current Share PriceEPS=30.002.00=15\text{P/E} = \frac{\text{Current Share Price}}{\text{EPS}} = \frac{30.00}{2.00} = 15P/E=EPSCurrent Share Price=2.0030.00=15
There are a couple of variations depending on your analysis:
Trailing P/E: Uses the current share price and the EPS from the past 12 months.
Forward P/E: Uses the current share price and forecasted EPS.
In most standard cases, you’d use the latest closing price from the market to compute the P/E ratio.
o3-mini-high
You said:
me: how do you get the trailing p/e, do you use daily bars? and how do you get the forecasted EPS?
ChatGPT said:
Reasoned for 12 seconds
Trailing P/E is calculated by dividing the current market price (usually the last closing price) by the earnings per share over the trailing twelve months (TTM EPS). Here’s how it typically works:
Trailing P/E:
Price: You take the most recent closing price of the stock. While daily bars provide the closing price, you don’t calculate the EPS from daily bars. Instead, you use the EPS figure from the company’s quarterly reports and annualize it (i.e., sum up the last four quarters of reported EPS).
EPS: The earnings data comes from the company’s financial reports. For example, if you have EPS for Q1, Q2, Q3, and Q4, you add those together to get the TTM EPS.
Source: Forecasted EPS isn’t derived directly from historical price data or daily bars. Instead, it’s typically provided by analysts’ consensus estimates or company guidance. Financial data providers (like Bloomberg, FactSet, or Reuters) compile these forecasts.
Usage: You take the current stock price and divide it by the forecasted EPS (often for the next 12 months) to get the forward P/E ratio.
In summary, daily bars are mainly used to get the current price (closing price) for the calculation. The earnings figure for trailing P/E comes from reported quarterly earnings, and forecasted EPS comes from analysts’ estimates or management’s guidance.
Here is a development version, although rough in visual style as a prototype, gives you a picture of how you can put this altogether for algorithmic insights into the market.
In statistics, regression toward the mean (also called regression to the mean, reversion to the mean, and reversion to mediocrity) is the phenomenon where if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean.[2][3][4] Furthermore, when many random variables are sampled and the most extreme results are intentionally picked out, it refers to the fact that (in many cases) a second sampling of these picked-out variables will result in “less extreme” results, closer to the initial mean of all of the variables.
Mathematically, the strength of this “regression” effect is dependent on whether or not all of the random variables are drawn from the same distribution, or if there are genuine differences in the underlying distributions for each random variable. In the first case, the “regression” effect is statistically likely to occur, but in the second case, it may occur less strongly or not at all.
Regression toward the mean is thus a useful concept to consider when designing any scientific experiment, data analysis, or test, which intentionally selects the most extreme events – it indicates that follow-up checks may be useful in order to avoid jumping to false conclusions about these events; they may be genuine extreme events, a completely meaningless selection due to statistical noise, or a mix of the two cases.
Mathematically, a continuous mean-reverting time series can be represented by an Ornstein-Uhlenbeck stochastic differential equation in the following form:
Where θ is the rate of reversion to the mean, μ is the mean value of the process, σ is the variance of the process and, finally, Wt is a Wiener process. The given equation implies that the change of the time series in the next period is proportional to the difference between the mean and the current value, with the addition of Gaussian noise.
We can see mean reversion as the line of linear regression as in this plot:
A key concept in testing for mean reversion is that of stationarity:
In mathematics and statistics, a stationary process (also called a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose statistical properties, such as mean and variance, do not change over time. More formally, the joint probability distribution of the process remains the same when shifted in time. This implies that the process is statistically consistent across different time periods. Because many statistical procedures in time series analysis assume stationarity, non-stationary data are frequently transformed to achieve stationarity before analysis.
The Augmented Dickey-Fuller test provides a quick check and confirmatory evidence that your time series is stationary or non-stationary. The ADF test is based on the simple observation that if the value level is higher than the mean, the next move will be downward while if the value is lower than the mean, the next move will be upward.
In the python code below we will simply interpret the result using the p-value from the test. A p-value below a specified threshold (we are going to use 5%) suggests we reject the null hypothesis (stationary), otherwise a p-value above the threshold suggests we accept the null hypothesis (non-stationary).
import numpy as np
from statsmodels.regression.linear_model import OLS
from statsmodels.tsa.tsatools import lagmat, add_trend
from statsmodels.tsa.adfvalues import mackinnonp
def adf(ts):
"""
Augmented Dickey-Fuller unit root test
"""
# make sure we are working with an array, convert if necessary
ts = np.asarray(ts)
# Get the dimension of the array
nobs = ts.shape[0]
# We use 1 as maximum lag in our calculations
maxlag = 1
# Calculate the discrete difference
tsdiff = np.diff(ts)
# Create a 2d array of lags, trim invalid observations on both sides
tsdall = lagmat(tsdiff[:, None], maxlag, trim='both', original='in')
# Get dimension of the array
nobs = tsdall.shape[0]
# replace 0 xdiff with level of x
tsdall[:, 0] = ts[-nobs - 1:-1]
tsdshort = tsdiff[-nobs:]
# Calculate the linear regression using an ordinary least squares model
results = OLS(tsdshort, add_trend(tsdall[:, :maxlag + 1], 'c')).fit()
adfstat = results.tvalues[0]
# Get approx p-value from a precomputed table (from stattools)
pvalue = mackinnonp(adfstat, 'c', N=1)
return pvalue
this code can also bevalidated by referencing the function adfuller, included in the Python module statsmodels
One can also test the stationarity by using the Hurst test. This measures the speed of diffusion in mean reversion, which should be slower then in a geometric random walk. The speed of diffusion is measured by it’s variance.
In code we can test for the Hurst exponent in the following code from Corrius (2018):
def hurst(ts):
"""
Returns the Hurst Exponent of the time series vector ts
"""
# make sure we are working with an array, convert if necessary
ts = np.asarray(ts)
# Helper variables used during calculations
lagvec = []
tau = []
# Create the range of lag values
lags = range(2, 100)
# Step through the different lags
for lag in lags:
# produce value difference with lag
pdiff = np.subtract(ts[lag:],ts[:-lag])
# Write the different lags into a vector
lagvec.append(lag)
# Calculate the variance of the difference vector
tau.append(np.sqrt(np.std(pdiff)))
# linear fit to double-log graph
m = np.polyfit(np.log10(np.asarray(lagvec)),
np.log10(np.asarray(tau).clip(min=0.0000000001)),
1)
# return the calculated hurst exponent
return m[0]*2.0
H=0.5, is a geometric random walk; for a mean reverting series, H<0.5, and, finally, for a trending series H>0.5. H also is an indicator for the degree of mean reversion or trendiness: as H decreases towards 0, the series is more mean reverting and as it increases towards 1, it is more trending.
To make sure it is not a random walk we can test the statistical significance of the H value with the Variance Ratio Test:
import numpy as np
def variance_ratio(ts, lag = 2):
"""
Returns the variance ratio test result
"""
# make sure we are working with an array, convert if necessary
ts = np.asarray(ts)
# Apply the formula to calculate the test
n = len(ts)
mu = sum(ts[1:n]-ts[:n-1])/n;
m=(n-lag+1)*(1-lag/n);
b=sum(np.square(ts[1:n]-ts[:n-1]-mu))/(n-1)
t=sum(np.square(ts[lag:n]-ts[:n-lag]-lag*mu))/m
return t/(lag*b);
#Source: Corrius (2018)
The test involves dividing the variance of group one by the variance of group two. If this ratio is close to one the conclusion drawn is that the variance of each group is the same. If the ratio is far from one the conclusion drawn is that the variances are not the same.
So how long will it take for the time series to mean revert, to diffuse back to the mean? This is seen in measuring the ‘half-life’ of the mean reversion.
import numpy as np
def half_life(ts):
"""
Calculates the half life of a mean reversion
"""
# make sure we are working with an array, convert if necessary
ts = np.asarray(ts)
# delta = p(t) - p(t-1)
delta_ts = np.diff(ts)
# calculate the vector of lagged values. lag = 1
lag_ts = np.vstack([ts[1:], np.ones(len(ts[1:]))]).T
# calculate the slope of the deltas vs the lagged values
beta = np.linalg.lstsq(lag_ts, delta_ts)
# compute and return half life
return (np.log(2) / beta[0])[0]
#source: Corrius (2018)
So we can see that we can understand mean regression in programming, namely for fintech, through the following steps:
Test for stationarity using the Augmented Dickey Fuller test (ADF Test)
Confirm by testing the Hurst Exponent (H)
Test for the variance ratio, F-ratio test
test for the time to mean revert using the half life test
A quick study of z-scores in varying samples of NVDA stock prices.
I used python to generate these plots.
animation of NVDA zscore scatter plot x , y are in standard deviation units, anything above 3 or below -3 is considered outliers in data science. DeepSeek release was nearly an outlier effect on NVDA value. Image shows last 15 days, 30 days, 45 days, 60 days and 93 days.
Z-Scores for QQQ ETF Tech Index stock, for same time periods above for NVDA. In an index stock such as QQQ one should see a smoother spectrum as it is less susceptible to volatility.
First, we see the last 93 trade days price chart. Then afterwards, we take a look at the 93, 60, 45, 30, 15 sample windows all going backward in time from Feb. 20th, 2025. One major petrubative wave that hit the stock was the release of DeepSeek which had a negative impact temporarily on NVDA value. The question is whether one can see a correlation to the action of the Index ETF for the sector NVDA is in and is a part of the portfolio. Examining the spread in the z-scores to see if one can tell if it is an indicator of up or down motion in relation to the index for NVDA stock prices.
Comparing different plots of Z-scores for the price of stocks for NVDA form Oct 4th, 2024 to February 20, 2025.
NVDA past 93 trading days. open price 124.92, close price 140.11 (gained 13%)
QQQ ETF, of which NVDA is a member of the portfolio, past 93 trading days, open price: 487.32 close price: 537.23 (gained 10.2%)
Z-Scores:
z-score is calcualted as:
93 Days:
NVDA past 93 trading days zscores
QQQ past 93 trading days
I use the terms "prices" as code for NVDA and "trends" for QQQ, the index ETF for tech stocks.
Some Data for Z-scores:
Shape of Z-score plots:
prices max/min, trends max/min: 1.7594241527219399 -2.9115388106292883 1.7722687913449942 -2.0281115555314564
NVDA length and mean of positive and negative:
prices positive len: 51 0.7123723838425315
prices negative len: 43 -0.8449067808364951
prices positive list: [0.14019500216746938, 0.13021735775384247, 0.9441080663512041, 0.9270035330706985, 0.35257629040040717, 0.4737334011373164, 0.6348010895287374, 0.4894125566444479, 0.5934651341008516, 0.3205052904994612, 0.40246451246854575, 1.2149298432925333, 1.6810283751862907, 1.502856153514364, 1.1650416212243906, 1.5969310865571407, 1.30900477633531, 1.3788482872307024, 0.6975177115572552, 0.43667357902955695, 1.4144827315650879, 1.254840420947041, 1.3660198872703233, 0.6932415782371287, 0.16585180208822778, 0.2200161574764928, 0.4523527345366844, 1.147937087943885, 1.136534065756884, 0.7630850891325253, 0.2456729573972512, 0.3169418460660218, 0.03614242471106574, 0.36825544590753456, 0.44665122344318386, 0.40531526801529805, 0.057523091311697735, 0.17440406872848058, 1.052436777127734, 1.7594241527219399, 0.4352482012561788, 0.43097206793605647, 0.08888140232595665, 0.5335992676190859, 1.4230349982053405, 1.4444156648059727, 0.7887418890532837, 0.25137446849075173, 0.32977024602640104, 0.30553882387901676, 0.43097206793605647]
prices negative list: [-1.6826224330881445, -2.0281115555314564, -1.5515748349199896, -1.2967600607041379, -1.3338240278628077, -1.2828610730196344, -1.0088524586680414, -1.4489870686772424, -1.4450159293388134, -1.4225128064210513, -1.2093949952586989, -1.146518622400242, -1.1107783683543844, -1.6137893512220423, -1.3516941548857364, -1.1531371879642918, -1.1478423355130543, -0.8327986146643587, -1.082980392985381, -1.9122866581606137, -1.6753420109676906, -1.7693256419771755, -1.3589745770061903, -0.47407236109295114, -0.032614037970937836, -0.26823497205105246, -1.0704051184136898, -0.8420646064540289, -0.6143859510507724, -0.6335797911865096, -0.5137837544772378, -0.4601733734084476, -0.4072248488960653, -0.22719986555395358, -0.49260434467228414, -0.19874003362854603, -0.10012340672422786, -0.16630906236470946, -0.36751345551177483, -0.4753960742057596, -0.507165188913192]
Trends (QQQ) length and mean for positive/negative:
trends positive len: 53 0.7493882152106839
trends negative len: 41 -0.9687213513699346
trends positive list: [0.05276545780528101, 0.09247685118957147, 0.07262115449743001, 0.011068494751777909, 0.1685903551761238, 0.2731636910880874, 0.6960900306307639, 0.6001208299520626, 0.9092078417931164, 0.6378466536671404, 0.520036186627085, 1.1368864971963693, 0.9105315549059249, 1.1772597471370638, 1.6829181562303424, 1.5260581523624006, 0.2466894288318963, 0.09446242085878412, 0.39163601468454706, 0.670277624930977, 1.1395339234219937, 1.1157070873914194, 0.6497600716824238, 0.18976976498108122, 0.3863411622333134, 0.7808076698505786, 0.16130993305566993, 0.16726664206331537, 0.2619121296292083, 0.022320056210664595, 0.595487834057233, 0.7980159403171032, 1.2421216896647371, 1.316911480538481, 1.117692657060632, 0.0971098470844085, 0.6001208299520626, 0.535258887424397, 0.6821910429462604, 0.631889944659495, 0.3552339040822852, 0.776174673955749, 0.9336965343800949, 1.1157070873914194, 0.6735869077129981, 1.0925421079172493, 1.009148181810243, 1.0296657350587888, 1.5326767179264504, 1.681594443117534, 1.7623409429989234, 1.7722687913449942, 1.6207036399282937]
trends negative list: [-1.6826224330881445, -2.0281115555314564, -1.5515748349199896, -1.2967600607041379, -1.3338240278628077, -1.2828610730196344, -1.0088524586680414, -1.4489870686772424, -1.4450159293388134, -1.4225128064210513, -1.2093949952586989, -1.146518622400242, -1.1107783683543844, -1.6137893512220423, -1.3516941548857364, -1.1531371879642918, -1.1478423355130543, -0.8327986146643587, -1.082980392985381, -1.9122866581606137, -1.6753420109676906, -1.7693256419771755, -1.3589745770061903, -0.47407236109295114, -0.032614037970937836, -0.26823497205105246, -1.0704051184136898, -0.8420646064540289, -0.6143859510507724, -0.6335797911865096, -0.5137837544772378, -0.4601733734084476, -0.4072248488960653, -0.22719986555395358, -0.49260434467228414, -0.19874003362854603, -0.10012340672422786, -0.16630906236470946, -0.36751345551177483, -0.4753960742057596, -0.507165188913192]
zscore silos:
prices 0 to 1: 35 0.4056562939002893
prices 1 to 2: 16 1.3722767337624457
prices 2 to 3: 0 nan
prices 3>: 0 nan
prices 0 to -1: 31 -0.4417795985574785
prices -1 to -2: 8 -1.4510292935080256
prices -2 to 3: 4 -2.7127490308408726
prices <3: 0 nan
trends 0 to 1: 35 -0.27555447552077394
trends 1 to 2: 17 1.344595325715943
trends 2 to 3: 0 nan
trends 3>: 0 nan
trends 0 to -1: 20 -0.41098618215387406
trends -1 to -2: 21 -1.3951694395818528
trends -2 to 3: 1 -2.0790028405616057
trends <3: 0 nan
60 DAYS
NVDA past 60 days open: 146.67, close: 140.11
QQQ past 60 days open: 504.98, close: 537.23
"prices" is for NVDA, "trends" is for QQQ
some data for 60 day plots:
shape of plots:
prices max/min, trends max/min: 1.9145807927766163 -2.645347442651133 1.8695586413001115 -1.7542554651846929
prices positive len: 34 0.6792199553321576
prices negative len: 26 -0.8882107108189714
prices positive list: [1.5305282468571573, 0.8737427335456234, 0.04858635771143208, 0.1738208835547299, 0.35888968285649864, 0.4117664826570029, 0.6385801239065363, 1.3176295529235456, 1.3064975950708098, 0.9419259753936441, 0.43681338782566403, 0.5063881244052757, 0.23226366228160591, 0.556481934742594, 0.6330141449801685, 0.5926607977639948, 0.18634433613906048, 0.2531360832554902, 0.36723865124605237, 1.2243994059068675, 1.9145807927766163, 0.6218821871274288, 0.617707702932656, 0.03327991566391562, 0.07919924180646105, 0.2837489673505192, 0.7178953236072966, 1.5861880361208474, 1.6070604570947318, 0.9669728805623052, 0.4423793667520319, 0.5189115769896063, 0.49525616655253607, 0.617707702932656]
prices negative list: [-1.7542554651846929, -1.669273142479056, -1.5853399842512712, -1.2999672462767844, -1.720682201893578, -1.2548531737293467, -0.6725668885240624, -0.506798901024174, -0.11546055078710396, -0.5487654801380664, -0.7900733100429681, -0.318998459489499, -0.6389936252329415, -1.098527666530088, -1.203444114314825, -0.32739177531227037, -0.6841076977803849, -0.6746652174797553, -1.5223901155804267, -1.6934039254695472, -1.7437638204062227, -0.524634697147575, -0.9044322381283226, -0.7858766521315705, -0.0913297677966126, -0.3767025057710995]
trends positive len: 34 0.7207852536147648
trends negative len: 26 -0.9425653316500864
trends positive list: [0.1636172003202925, 0.01148835103241942, 0.5014481621871478, 0.07129072626972463, 0.8623607425666361, 0.5035464911428407, 0.926359775715327, 1.7279214367907145, 1.4792694555408878, 0.12269978568424661, 0.8665574004780336, 0.828787479275527, 0.09017568687097201, 0.29791025348475275, 0.004144199687494526, 0.32518852990878333, 1.0291778945443708, 1.147733480541123, 0.8319349727090662, 0.01148835103241942, 0.14158474628549397, 0.06184824596909498, 0.2905661021398278, 0.5402672478675009, 0.828787479275527, 0.12794560807347868, 0.7920667225508667, 0.6598719983420994, 0.6923960971553621, 1.4897611003193638, 1.7258231078350217, 1.8538211741324033, 1.8695586413001115, 1.6292999758730682]
trends negative list: [-1.7542554651846929, -1.669273142479056, -1.5853399842512712, -1.2999672462767844, -1.720682201893578, -1.2548531737293467, -0.6725668885240624, -0.506798901024174, -0.11546055078710396, -0.5487654801380664, -0.7900733100429681, -0.318998459489499, -0.6389936252329415, -1.098527666530088, -1.203444114314825, -0.32739177531227037, -0.6841076977803849, -0.6746652174797553, -1.5223901155804267, -1.6934039254695472, -1.7437638204062227, -0.524634697147575, -0.9044322381283226, -0.7858766521315705, -0.0913297677966126, -0.3767025057710995]
zscore silos:
prices 0 to 1: 26 0.4857168466015017
prices 1 to 2: 7 1.4753238481642317
prices 2 to 3: 0 nan
prices 3>: 0 nan
prices 0 to -1: 20 -0.42879927835305054
prices -1 to -2: 3 -1.5721141625126993
prices -2 to 3: 4 -2.4158942235473466
prices <3: 0 nan
trends 0 to 1: 23 -0.06220277723223607
trends 1 to 2: 9 1.5519738719284009
trends 2 to 3: 0 nan
trends 3>: 0 nan
trends 0 to -1: 18 -0.4898406138886327
trends -1 to -2: 10 -1.5547354231811998
trends -2 to 3: 0 nan
trends <3: 0 nan
45 DAYS
NVDA past 45 days open: 134.25, close: 140.11
QQQ past 45 days open: 530.53, close: 537.23
some data for 45 day plot:
shape of plots:
prices max/min, trends max/min: 2.0056164520976707 -2.335406431884398 1.7445748897642444 -2.004426294606925
zscore means of prices and trends: 4.46309655899313e-15 -1.0288066694859784e-14
NVDA prices length and mean of positive/negative:
prices positive len: 24 0.7357314773777098
prices negative len: 21 -0.8408359741459445
prices positive list: [0.054341899730996124, 0.712714915702643, 0.7855730965445773, 0.7471569648279229, 0.36034625926711406, 0.4239315807291669, 2.943764882851058e-05, 0.5325565048935021, 1.3485681303231487, 2.0056164520976707, 0.7749755429675673, 0.7710014603761928, 0.2146298975832493, 0.2583448060884106, 0.45307485306593986, 0.8663794425692682, 1.6929886215759211, 1.7128590345328127, 1.1034997038548302, 0.13249885736143355, 0.6040899915383078, 0.6769481723802421, 0.6544283710290971, 0.7710014603761928]
prices negative list: [-0.7645562745724367, -1.014925110927081, -0.5261615999565064, -0.06787777341171075, -0.10162313831168997, -0.8581724481659111, -1.334961797397784, -1.443817813204148, -0.5348700812210075, -0.9049805349626544, -0.8951834935400782, -1.7747401012554962, -1.9521754070198687, -2.004426294606925, -0.739519390936971, -1.1335781681560086, -0.19088507127290136, -1.010570870294818, -0.18326515016646283, -0.28994404565668896, -0.048283690566570704, -0.1310142625794062, -0.5860324086500015, -0.0624349726213975]
pos/neg length and mean:
trends positive len: 21 0.8835238047359075
trends negative len: 24 -0.7730833291439385
trends positive list: [0.7659593076650326, 1.5976192684256507, 1.3396305109645679, 0.7039113786554122, 0.6647232129651197, 0.11391177298491237, 0.14221433709456596, 0.8726382031552712, 0.9956455010164618, 0.6679888934393077, 0.10629185187847384, 0.3653691694976192, 0.6647232129651197, 0.6266236074328899, 0.4894650275168725, 0.5232103924168393, 1.3505161125452068, 1.5954421481095253, 1.7282464873932923, 1.7445748897642444, 1.4952946135676752]
trends negative list: [-0.7645562745724367, -1.014925110927081, -0.5261615999565064, -0.06787777341171075, -0.10162313831168997, -0.8581724481659111, -1.334961797397784, -1.443817813204148, -0.5348700812210075, -0.9049805349626544, -0.8951834935400782, -1.7747401012554962, -1.9521754070198687, -2.004426294606925, -0.739519390936971, -1.1335781681560086, -0.19088507127290136, -1.010570870294818, -0.18326515016646283, -0.28994404565668896, -0.048283690566570704, -0.1310142625794062, -0.5860324086500015, -0.0624349726213975]
zscore silos:
prices 0 to 1: 19 0.5159193883045129
prices 1 to 2: 4 1.4639440396810512
prices 2 to 3: 1 2.005079787550808
prices 3>: 0 nan
prices 0 to -1: 14 -0.38097024812529595
prices -1 to -2: 4 -1.454678571075471
prices -2 to 3: 3 -2.1703421886682137
prices <3: 0 nan
trends 0 to 1: 13 0.18729980332171914
trends 1 to 2: 8 1.5002054504426652
trends 2 to 3: 0 nan
trends 3>: 0 nan
trends 0 to -1: 16 -0.42194589789701936
trends -1 to -2: 7 -1.3777869923741477
trends -2 to 3: 1 -2.0051154449596136
trends <3: 0 nan
30 DAYS
NVDA past 30 days, during which it took a major dive with the release of DeepSeek close: open: 140.14, close: 140.11
QQQ past 30 days open: 515.18, close: 537.23
Some Data for 30 day plot
shape of zscore plot
prices max/min, trends max/min: 1.7395624418627935 -2.0102455727613866 1.635662150597431 -1.980244163451975
beats_condition, max, min: False True False
avg zscores list: [-0.04900609322786642, -0.043237954955050006, -1.4069222487518587, -1.9069027691119136, -2.1376722944324356, -0.3679613081523311, -1.0756478274354047, 0.34156933192037153, 1.0456770404126563, 2.5158374261919283, 2.6528832614306106, 1.7724244766420925, -2.8159600623456074, -0.7210526618453241, -1.4730439252605447, -1.1233948714886564, -1.765168780240289, -2.622450849103493, -1.7105258315300362, -0.7023407103306647, 0.0587929352193316, -0.500214456769451, 0.6220642834292512, 0.39529360882055975, 0.2241537498518348, 1.531306979740637, 2.2043609778904045, 2.3999373023930133, 2.3948264891602498, 2.2623747818778503]
zscore differential avg: -4.4704980458239636e-15
zscore means of prices and trends: -7.919590908992784e-16 -3.796962744218036e-15
trend count and mean of pos or neg:
prices positive len: 17 0.7172287353811013
prices negative len: 13 -0.9379145001137494
prices positive list: [0.8708241976370331, 0.8671431033818425, 0.35178990765469526, 0.022945487524039902, 0.3922819444618296, 0.06466455574957113, 0.5726555629663304, 0.9554893655064959, 1.7211569705868235, 1.7395624418627935, 1.1751279893997306, 0.06466455574957113, 0.2757139597140209, 0.7125371446636966, 0.780023872675586, 0.7591643385628187, 0.8671431033818425]
prices negative list: [-0.9198302908648995, -0.9103810583368925, -1.758712156406554, -1.9298482566359534, -1.980244163451975, -0.7602432526141607, -1.1403123831849757, -0.23108623104595882, -1.0216720192222633, -0.22373682796862931, -0.326628471051326, -0.09354740202724439, -0.1733409211526078, -0.6122052763421065, -0.10719629345658256]
trend count and mean of pos or neg:
trends positive len: 15 0.8125990002508011
trends negative len: 15 -0.8125990002508086
trends positive list: [0.09018767490616038, 0.7946804556051048, 0.9133208195678171, 0.5972964872423618, 0.055540488970154546, 0.30542019359958456, 0.5941467430663635, 0.5573997276796802, 0.42511047228762966, 0.45765782877296995, 1.2555930200266159, 1.4918238332267078, 1.6199134297174271, 1.635662150597431, 1.395231678496008]
trends negative list: [-0.9198302908648995, -0.9103810583368925, -1.758712156406554, -1.9298482566359534, -1.980244163451975, -0.7602432526141607, -1.1403123831849757, -0.23108623104595882, -1.0216720192222633, -0.22373682796862931, -0.326628471051326, -0.09354740202724439, -0.1733409211526078, -0.6122052763421065, -0.10719629345658256]
zscore silos:
prices 0 to 1: 14 0.5195964415593288
prices 1 to 2: 3 1.5885045560705986
prices 2 to 3: 0 nan
prices 3>: 0 nan
prices 0 to -1: 7 -0.38925651381809206
prices -1 to -2: 5 -1.4606024370377964
prices -2 to 3: 1 -2.0120560681267605
prices <3: 0 nan
trends 0 to 1: 11 0.23112599712881077
trends 1 to 2: 5 1.4620954964259174
trends 2 to 3: 0 nan
trends 3>: 0 nan
trends 0 to -1: 9 -0.4261916020115731
trends -1 to -2: 4 -1.522227807745561
trends -2 to 3: 1 -2.0472318717466145
trends <3: 0 nan
15 DAYS
NVDA past 15 days during which price regained momentum and climbed back up. open: 124.65, min on day81 at 116.64, close: 140.11
QQQ past 15 days open: 523.05, close: 537.23.
Some Data for 15 day plot
shape of z-score:
prices max/min, trends max/min: 1.2978590301422221 -1.7892515585278195 1.4875489718437718 -1.7015455028399435
prices positive len: 8 0.7961212804164766
prices negative len: 7 -0.9098528919045411
prices positive list: [0.4368908744960763, 0.3355230641218386, 0.11698986253581116, 0.6633228665008721, 1.1319844313480085, 1.2043900101867528, 1.1820101040002302, 1.2978590301422221]
prices negative list: [-0.9657152036835966, -1.0789198650922653, -1.7015455028399435, -0.7542012310515969, -0.3996918966402359, -0.9850791589245527, -0.04220349219180732, -0.22988490452723273, -0.18370931895265175]
trends positive len: 6 1.056825095650698
trends negative len: 9 -0.7045500637670981
trends positive list: [0.009930233456925725, 0.9483372951340528, 1.2834826743044578, 1.465205946565748, 1.4875489718437718, 1.1464454525992316]
trends negative list: [-0.9657152036835966, -1.0789198650922653, -1.7015455028399435, -0.7542012310515969, -0.3996918966402359, -0.9850791589245527, -0.04220349219180732, -0.22988490452723273, -0.18370931895265175]
zscore silos:
prices 0 to 1: 5 0.3387271345868971
prices 1 to 2: 4 1.13156977020328
prices 2 to 3: 0 nan
prices 3>: 0 nan
prices 0 to -1: 3 -0.4156379102954067
prices -1 to -2: 3 -1.6576670076204636
prices -2 to 3: 0 nan
prices <3: 0 nan
trends 0 to 1: 1 0.584379796953797
trends 1 to 2: 4 1.3465306598267066
trends 2 to 3: 0 nan
trends 3>: 0 nan
trends 0 to -1: 7 -0.3399223000040687
trends -1 to -2: 3 -1.315505544516218
trends -2 to 3: 0 nan
trends <3: 0 nan
The above charts and data are generated in the following code snippets.
This code snippet gets data into Pandas Dataframes from the Alpaca API.
############### INIT CEILLI CLASSES ####################
from classes.stock_list import StockList
from classes.config import Config
from classes.alpaca import Alpaca
from classes.utilities import Utilities
from classes.market_beat import MarketBeat
from classes.profit_loss import ProfitLoss
from classes.plots import Plots
util = Utilities(pd.DataFrame())
conf = Config(api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
mb = MarketBeat(pd.DataFrame(), api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
alpa = Alpaca(api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
stocks = StockList()
plots = Plots(pd.DataFrame())
############## SETTINGS ###################
#CONSTANTS, see setting.toml for conflicts, set here to overide settings.toml file Constants
ALGO_VERSION = conf.algo_version
BASE_CURRENCY = conf.base_currency
############# LOGGING #################
import logging
logging.basicConfig(
filename="logs/charts_"+ALGO_VERSION+".log",
level=logging.INFO,
format="%(asctime)s:%(levelname)s:%(message)s"
)
alpa = Alpaca(api_key=api_key, api_secret=api_secret, api_base_url=api_base_url, algo_version=ALGO_VERSION)
############################### CONFIGS ###################################################
# API Credentials alpaca4 edge
API_KEY = conf.api_key
API_SECRET = conf.api_secret
API_BASE_URL = conf.api_base_url
SECRET_KEY = API_SECRET
#CONSTANTS
TIMEZONE_OFFSET = -4.0 #set in config file, this is deprecated, i think
if DEBUG:
PROCESS_ROWS = 0 #set to low number for debugging, otherwise 1000
else:
PROCESS_ROWS = 1000
########################### DRIVER #######################################
date = DATE
from datetime import date
from datetime import timedelta
import datetime
from datetime import datetime, timezone, timedelta
N_DAYS_AGO = 500
YESTERDAY = 1
#today = datetime.now()
today = date.today()
n_days_ago = today - timedelta(days=N_DAYS_AGO)
one_day_ago = today - timedelta(days=YESTERDAY)
today = date.today()
timezone_offset = -4 # EST is -4, that is 4 hours behind GMT
tzinfo = timezone(timedelta(hours=timezone_offset))
now = datetime.now(tzinfo)
back_time = now - timedelta(minutes=15)
date = back_time.strftime("%Y-%m-%d %H:%M:%S")
start_time = now - timedelta(minutes=45)
start = start_time.strftime("%Y-%m-%d %H:%M:%S")
end = date
beg_date = str(n_days_ago) + ' 00:00:00'
end_date = str(one_day_ago) + ' 23:59:00'
if MODE == 'SCREENER' or MODE == 'HISTORICAL':
try:
#STOCK_LIST = stocks.TECH_AL
STOCK_LIST = ['NVDA', 'MSFT']
STOCK_SET = set(STOCK_LIST) #remove duplicates from list
STOCK_LIST = list(STOCK_SET)
STOCK_LIST = sorted(STOCK_LIST)
symbol_list = STOCK_LIST
index_symbol = stocks.stock_index(ALGO_VERSION)
cnt = 0
for symbol in symbol_list:
print(ALGO_VERSION)
print(symbol)
print(index_symbol)
hundred_dates = alpa.get_calendar(str(n_days_ago), str(one_day_ago))
#get prices for symbol in trading list
symbol_price_data = alpa.stockbars_by_symbol_by_day(symbol, beg_date, end_date)
symbol_price_data = symbol_price_data.reset_index(level=("symbol", "timestamp"))
prices_data = symbol_price_data
#get prices for trend index for symbol above
index_price_data = alpa.stockbars_by_symbol_by_day(index_symbol, beg_date, end_date)
index_price_data = index_price_data.reset_index(level=("symbol", "timestamp")) #alpaca dataframe return has an index of symbol, timestamp format
column_names = index_price_data.columns
trends_data = index_price_data
symbol_prices = symbol_price_data
column_names = prices_data.columns
print(column_names)
prices_data = symbol_prices[['timestamp', 'symbol', 'open', 'close', 'vwap']].copy()
trends_data = trends_data[['timestamp', 'symbol', 'open', 'close', 'vwap']].copy()
prices_data.rename(columns = {'timestamp':'date'}, inplace = True)
trends_data.rename(columns = {'timestamp':'date'}, inplace = True)
#prices_data = prices_data.reset_index()
#trends_data = trends_data.reset_index()
date_stamp = prices_data.iloc[0]['date']
print()
print()
print("Statistical Analysis: ")
prices_arr = np.array(prices_data['close'])
from scipy.stats import skew, kurtosis
# Calculate the skewness
print("Symbol Prices skew: ")
print(skew(prices_data['close'], axis=0, bias=True))
print("Index skew: ")
print(skew(trends_data['close'], axis=0, bias=True))
# Calculate the kurtosis
print("Symbol Prices kurtosis: ")
print(kurtosis(prices_data['close'], axis=0, bias=True))
print("Index kurtosis: ")
print(kurtosis(trends_data['close'], axis=0, bias=True))
print()
print("Covariance between the two: ")
cov_matrix = np.stack((prices_data['close'], trends_data['close']), axis = 0)
print(np.cov(cov_matrix))
print()
print("Correlation between the two: ")
correlations = np.correlate(prices_data['close'], trends_data['close'])
print(correlations)
print()
print()
print("Mean of the Symbol: ")
data_mean = np.mean(prices_data['close'])
data_max = max(prices_data['close'])
data_min = min(prices_data['close'])
print("mean is: " + str(data_mean))
print("max/min is: "+str(max(prices_data['close'])), str(min(prices_data['close'])))
print()
print()
print("Variance of the Symbol: ")
m = sum(prices_data['close']) / len(prices_data['close'])
std_dev = np.std(prices_data['close'])
print("std dev: "+str(std_dev))
import scipy.stats as scipy
zscore_list = scipy.zscore(prices_data['close'])
print("symbol z-scores list: ")
print(zscore_list)
trends_zscore_list = scipy.zscore(trends_data['close'])
print("trends z-scores list: ")
print(trends_zscore_list)
import statistics
# Calculate the variance from a sample of data
data_variance = statistics.variance(prices_data['close'])
print("variance result: "+str(data_variance))
print()
print()
print("Market Beat Metrics: ")
#prices = prices_data.iloc[:lookback_period]
vars, vibe_check = mb.compare_rates(trends_data, prices_data)
vars15, vibe_check15 = mb.compare_rates(trends_data[-15:], prices_data[-15:])
vars30, vibe_check30 = mb.compare_rates(trends_data[-30:], prices_data[-30:])
vars45, vibe_check45 = mb.compare_rates(trends_data[-45:], prices_data[-45:])
vars60, vibe_check60 = mb.compare_rates(trends_data[-60:], prices_data[-60:])
print(vars)
the zscores are put into silos based on standard deviation in a Market Beat Class function, a snippet from that is following, which appends the zscore value to a list based on each silo or bin, I included the logic here because it can be beneficial to be able to sort these ito bins:
if current_idx_z >= 0:
trends_positive.append(current_idx_z)
else:
trends_negative.append(current_idx_z)
if current_price_z >= 0:
prices_positive.append(current_price_z)
else:
prices_negative.append(current_price_z)
if current_price_z > 0 and current_price_z < 1:
prices_0to1.append(current_price_z)
elif current_price_z > 1 and current_price_z < 2 :
prices_1to2.append(current_price_z)
elif current_price_z > 2 and current_price_z < 3:
prices_2to3.append(current_price_z)
elif current_price_z > 3 and current_price_z < 8:
prices_3up.append(current_price_z)
elif current_price_z <= 0 and current_price_z > -1:
prices_0toneg1.append(current_price_z)
elif current_price_z <= 1 and current_price_z > -2:
prices_neg1toneg2.append(current_price_z)
elif current_price_z <= 2 and current_price_z > -3:
prices_neg2toneg3.append(current_price_z)
elif current_price_z <= 3:
prices_neg3.append(current_price_z)
if current_idx_z >= 0 and current_idx_z < 1:
trends_0to1.append(current_price_z)
elif current_idx_z >= 1 and current_idx_z < 2:
trends_1to2.append(current_idx_z)
elif current_idx_z >= 2 and current_idx_z < 3:
trends_2to3.append(current_idx_z)
elif current_idx_z >= 3:
trends_3up.append(current_idx_z)
elif current_idx_z < 0 and current_idx_z > -1:
trends_0toneg1.append(current_idx_z)
elif current_idx_z < 1 and current_idx_z > -2:
trends_neg1toneg2.append(current_idx_z)
elif current_idx_z < 2 and current_idx_z > -3:
trends_neg2toneg3.append(current_idx_z)
elif current_idx_z < -3 and current_idx_z > -8:
trends_neg3.append(current_idx_z)
The graphing part is handled in a Plots Class that is called by this code:
path_15_index = 'plots/stats/zscores/scatter/'+str(today)+'_'+index_symbol+'_15.png'
print(path_15_index)
isFile = os.path.isfile(path_15_index)
if isFile == False:
symbol_zscores_plot = plots.zscores_scatter_by_day(today, index_symbol, trends_data['zscores'][-15:], '15')
else:
print(index_symbol + ' zscores scatter plot file exists for this date')
Then in the plots class I generate the plots:
in the Plots Class:
def zscores_scatter_by_day(self, plot_date, symbol, data, periodicity='all'):
plot_date = str(plot_date)
zscores = data.reset_index(drop = True)
#zscores = zscores.tolist()
print(zscores)
# PLOTTING
import matplotlib.pyplot as plt
zscores_set = set(zscores) #remove duplicates from list
zscores_list = list(zscores_set)
zscores_list = sorted(zscores_list)
print("zscores sorted and unique: ", zscores_list)
import seaborn as sns
sns.displot(zscores_list, color="maroon")
plt.xlabel("zscore", labelpad=14)
plt.ylabel("probability of occurence", labelpad=14)
plt.title("Percent Ratio Z-scores distribution" + plot_date, y=1.015, fontsize=10);
#plt.show()
plt.savefig('plots/stats/zscores/'+symbol+'_'+str(plot_date)+'_'+periodicity+'.png',bbox_inches='tight')
plt.clf()
import matplotlib.pyplot as plt2
x_cnt = 0
color = 'grey'
# https://matplotlib.org/stable/gallery/color/named_colors.html
for i in zscores:
if i < 0 and i > -1:
color = 'orange'
elif i < -1 and i > -2:
color = 'indianred'
elif i < -2 and i > -3:
color = 'firebrick'
elif i < -3 and i > -4:
color = 'maroon'
elif i > 0 and i < 1:
color = 'yellow'
elif i > 1 and i < 2:
color = 'green'
elif i > 2 and i < 3:
color = 'forestgreen'
elif i > 3 and i < 4:
color = 'darkgreen'
elif i > 4 and i < 5:
color = 'darkolivegreen'
elif i > 5:
color = 'black'
print(zscores)
plt2.scatter(i, zscores[x_cnt], c=color)
x_cnt += 1
# depict first scatted plot
#plt.scatter(x, y, c='blue')
print('plots/stats/zscores/scatter/'+str(plot_date)+'_'+symbol+'_'+periodicity+'.png')
plt2.savefig('plots/stats/zscores/scatter/'+str(plot_date)+'_'+symbol+'_'+periodicity+'.png',bbox_inches='tight')
plt2.clf()
# depict illustration
#plt.show()
this function outputs the plots into a directory for safe keeping and reference as needed. The first part of the function generates the zscore bar charts and the second part of the function generates the rainbow spectrum charts of zscores. You have to pass in a dataframe of zscores to be plotted, plus the other apparent variables that are easy to figure out for oneself. You’ll need to include these libraries in your own code for this function to work.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import iplot, init_notebook_mode
import seaborn as sns
def zscores_scatter_by_day(self, plot_date, symbol, data, periodicity='all'):
...
...
...
I hope this can provide some insights to others on how to plot z-scores and work with pandas.
One of the most fundamental levels of working with Machines is an understanding of statistics, which is the basis of most contemporary methods in computer science. Yet, it can be an impenetrable subject that does involve a committed level of aspiration to try and pierce through the fuzzy layers with maths to turn the typical person off to the subject. This is not an attempt to make it easier to understand this subject but is an abbreviated attempt to get to the meat of the matter at hand.
Statistics always begins with a discussion of probability and the two concepts of probability: statistical probability and inductive probability. The first being based in quantitative understandings and the latter in qualitative understandings. The first is the one we deal with in ML and Software Engineering. The quantitative stats probability deals with two main laws of probs: addition and multiplication.
A. The Law of Addition
“The law of addition of probabilities states that if A and B are mutually exclusive events, that is if they cannot both occur together, then the probability that either A or B will occur is equal to the sum of their separate probabilities: in symbols: P(A or B) = P(A) + P(B)
This follows from the fact, that if A and B are mutually exclusive, the number of times on which either A or B has occurred is the number of times on which A has occurred plus the number of times on which B has occurred; the same must therefore be true of the corresponding proportions and so, as the number of observations increases, of their limiting values or probabilities. This law can be exteded to any number of events, provided they are all mutually exclusive.” (Bulmer, 1979, 12-3)
One way to understand this as a coder is to actually code these formulas in a theoretical example. The code we use in this work is based on Python, see my book PlayAI: Machine Learning in Video Game Design for an introduction into setting up a Python development environment if you are not familiar with Python, we will be using DataFrames to work with statistical data, which is what you will use in the real world to work with Machine Learning, although you can also use R language.
Python example:
In this example we are working with the second rule of addition in stats.
# P(A or B) = P(A) + P(B) – P(A and B)
In a small survey, 32 people responded to the question “Is a hotdog a sandwich?”.
50% of the respondents were female and the rest were male.
11 people responded with ‘yes’ and 21 responded with ‘no’.
Of the female participants who took the survey, 5 responded with ‘yes’.
Below is a DataFrame representing all the responses to the survey:
“The law of multiplication of probabilities states that if A and B are two events, then the probability that both A and B will occur is equal to the probability that A will occur multiplied by the conditional probability that B will occur given that A has occurred, or in symbols P(A and B) = P(A) x (P(B | A)
Caroline and Victor are feeling adventurous, and want to travel twice this year. However, they are feeling indecisive, and put all of the places they are considering on slips of paper in a hat. Caroline randomly selects 2 slips of paper with destinations from the hat without replacement. What is the probability that both destinations are in Europe?
Hand Calculations
The question asks us to find the probability of both events occurring, so we know we need to use the multiplication rule. However, since Caroline is selecting from the hat without replacement, we know the events are dependent. Therefore, we need to use the formula P(A and B) = P(A) * P(B|A) where:
P(A) is the probability Caroline chooses a destination in Europe on her first selection. This is the total amount of destinations in Europe (7) divided by the total amount of destinations (21), 7⁄21 or approximately 33.33% or approximately 0.3333.
P(B) is the probability Caroline chooses a destination in Europe on her second selection.
P(B|A) is the probability Caroline chooses a destination in Europe on her second selection given she chose Europe on her first selection. This is the total amount of destinations in Europe divided by the total amount of destinations after choosing 1 Europe slip, 6⁄20 or 30% or 0.3.
Therefore:
P(A and B) = P(A) * P(B|A) = 7⁄21 * 6⁄20 = 0.1 or 10%
The probability that Caroline chooses 2 slips of paper with destinations in Europe from a hat without replacement is 10% (0.1).
In Machine Learning we are solving for X by features Y, X is refered to as a random variable- numerical variable which takes different values with different probabilities.
A simple example of this process is flipping coins:
X = number of heads. [X is a random variable or function]
Here, the sample space S = {HH, HT, TH, TT}
where the sample space is merely the different combinations of features to give us X, a numerical variable which takes different values with different probabilities.
Two basic components to understanding sample space and finding X are Frequency distributions and probability distribution.
Frequency distribution- the representation of different values across all results
cumulative probability Function F(x): the probability that X is less than or equal to some particular value x:
the cumulative probability function can clearly be calculated by summing the probabilities of all values less than or equal to x:
discrete random variable will be a step function a continuous random variable are based on measurements not fixed values, the measurement occurs in a range and the graph looks continuous.
For example, Suppose a die is thrown (X = outcome of the dice).
Here, the sample space S = {1, 2, 3, 4, 5, 6}. The output of the function will be:
Probability density function- f(x) [not F(x)] The area under the density function between any two points, x1 and x2, that is to say the integral of the function between them, represents the probability that the random variable will lie between these two values:
If dx is a very small intrement in x, so small that the density function is practically constant between x and x+dx, then probability that X will lie in this small interval is very neearly f(x)dx, which is the area of a rectangle with height f(x) and width dx. f(x) may therefore be thought of as representing the probability density at x.
A continuous probability distribution can also be represented by its cumulative probability function F(x), which, as in the discrete case, specifies the probability that X is less than or equal to x and which is the limiting form of the cumulative frquency diagrm shwoing the proportion of observations up to a given value.
Frequency distribution and probability distribution are two fundamental concepts in statistics that describe how data points are spread across different values or ranges.
Frequency Distribution
A frequency distribution is a summary of how often each value or range of values occurs in a dataset. It is typically represented in a table or a graph, showing the frequency (count) of each unique value or interval.
Example of Frequency Distribution
Consider a dataset representing the number of pets owned by 20 households:
This table shows how many households have 0, 1, 2, 3, or 4 pets.
Probability Distribution
A probability distribution describes the likelihood of each possible value occurring in a random variable. It assigns a probability to each value, where the sum of all probabilities equals 1. Probability distributions can be discrete or continuous.
Example of Probability Distribution
Using the same dataset of pet ownership, we can convert the frequency distribution into a probability distribution:
Calculate the total number of households: N=20N=20.
Calculate the probability for each number of pets: