AI × ML Framework — BrainX Algo

§ II.

∑

The Foundation

Data

The unfair advantage in quantitative finance does not live in the model. It lives in the data — in having more of it, cleaner versions of it, faster access to it, and the right kind of it. Spending three months tuning XGBoost on a dataset full of survivorship bias and unaligned timestamps is a more popular activity than spending three months getting the data right, but the second project produces strategies that work and the first does not.

Sources

Source	What it provides	Notes
Binance / Coinbase / Bybit	OHLCV, trades, order book, funding rates, open interest	Free REST + websocket. Watch for outages and rate limits.
Kaiko / Tardis.dev	Tick-by-tick history, L2/L3 book snapshots, cross-venue	Paid. Essential for microstructure work.
Glassnode / CryptoQuant	On-chain — flows, NUPL, MVRV, miner positions	Paid for depth. Beware revision history.
Coinglass	Aggregate OI, liquidation maps, funding heatmaps	Useful as derivative-side cross-check.
FRED / Yahoo	DXY, SPX, VIX, US 10Y, gold — macro context	Free. Lower frequency.
Twitter / X · Reddit · Telegram	Sentiment, narrative shifts, influencer flow	API access expensive post-2023. See §VII.

Frequencies & Horizons

The choice of bar interval defines the strategy more than any single hyperparameter. A 1-minute model is a different beast — different features, different costs, different infrastructure — from a 4-hour model, even if the algorithm code is identical.

Bar	Typical use	Edge source	Cost sensitivity
Tick / 100ms	Market-making, latency arb	Microstructure, queue position	Extreme — colocate or don't bother
1m	Intraday momentum, mean reversion	Order flow, volatility clustering	High — fees dominate weak signals
5m–15m	Intraday swing	Mixed micro + macro features	Moderate
1h	Tactical positioning	Technicals, sentiment, funding	Low
4h–1d	Trend, regime allocation	On-chain, macro, narrative	Negligible

Information Bars (López de Prado)

Time bars are a calendar artefact, not a market property. Volume, dollar, and imbalance bars sample by activity rather than by clock and produce returns that are closer to i.i.d. — which matters because most ML assumes it. A dollar bar that closes every \$50M traded breathes with the market: it slows down on quiet Sundays and speeds up during liquidation events, which is exactly the inverse of what time bars do.

Quality Problems You Will Have

Timestamps misalignedExchange clocks drift. UTC versus exchange-local time confusions kill more strategies than bad models.
Survivorship in token universesIf you're modelling alts as well as BTC, your basket of "top 50 coins" must be the top 50 as of that date, not today's top 50.
Lookahead via revised dataOn-chain metrics get revised. The MVRV you query today for last Wednesday is not the MVRV that was available last Wednesday. Use point-in-time data or accept the bias.
Halt and outage gapsBinance went down for an hour during the LUNA collapse. Your model needs to know that — either fill, or flag and refuse to trade.
Wash tradingSmaller venues print volume that did not happen. Confirm OHLCV against multiple exchanges or stick to top-tier venues.

Storage

For anything beyond a notebook prototype: store raw data once, store engineered features in a parquet-based warehouse, and version both. pandas + parquet works to about a year of 1m bars. Beyond that, look at DuckDB, arctic, or ClickHouse. Pickled DataFrames are a footgun — they bind to library versions.

§ III.

Engineering Signal

Features

Feature engineering is where domain expertise enters the model. A textbook ML practitioner trying their hand at crypto will typically build 500 features, none of them informed by how markets actually move, then complain that random forests don't work. The features below are the ones that consistently survive the validation gauntlet in practitioner research — not because they are magic, but because they encode something real about market microstructure or human behaviour.

Returns & Volatility

Family	Examples	Why
Log returns	r₁, r₅, r₁₅, r₆₀, r₂₄₀ (multi-horizon)	Stationary, additive across time
Realised vol	Std of r over rolling windows; Parkinson, Garman-Klass, Yang-Zhang	Clusters; predicts itself; gates position size
Vol-of-vol	Std of realised vol	Regime indicator
Skew & kurtosis	Rolling moments of returns	Tail behaviour; pre-crash signature

Momentum & Mean Reversion

The classical technicals — RSI, MACD, Bollinger Z-score, donchian width — are not magic, but they are compressions of price action that often correlate with whatever real edge exists. Use them as inputs to a model, not as standalone signals. The model decides when each one is informative.

Microstructure & Order Flow

Cumulative Volume Delta (CVD)Signed volume — buys minus sells classified by trade side. Divergence from price often precedes reversals.
Trade size distributionWhale prints versus retail. The 95th percentile of trade notional is a different feature from the median.
Order book imbalance(Bid depth − Ask depth) / total at top N levels. Short-horizon directional signal.
Book slopeHow quickly liquidity thins moving away from mid. Predicts slippage and breakout fragility.
Funding ratePerpetual swap funding. Persistently positive funding = crowded longs = fade candidate at extremes.
Open interest deltaRising OI with rising price = new longs. Rising OI with falling price = new shorts. Different conviction profiles.

Cross-Asset & Regime

BTC does not trade in isolation. Useful inputs from outside the BTC tape:

Feature	What it captures
BTC/ETH spread	Crypto-internal risk appetite
BTC/SPX 30d corr	Risk-asset vs uncorrelated-asset regime
DXY change	Dollar strength — historically inverse BTC
VIX change	Equity vol regime; BTC sometimes follows, sometimes leads
US 10Y yield delta	Real-rate sensitivity
Stablecoin supply growth	Liquidity entering crypto

Fractional Differentiation

The standard fix for non-stationary price series is to take first differences (returns). The cost is that all memory is destroyed — long-horizon levels become invisible. Fractional differentiation, popularised by López de Prado, takes a non-integer derivative that preserves as much memory as possible while passing a stationarity test. For BTC, fractional orders around 0.3–0.5 typically work.

Labelling — The Triple Barrier Method

Most beginners label as y = sign(return_t+k), which produces noisy targets dominated by drift. The triple-barrier method instead labels each observation by which of three events fires first: a profit-taking horizontal barrier, a stop-loss horizontal barrier, or a vertical time barrier. Labels become trading-decision-aligned, and the path matters, not just the endpoint.

# Triple-barrier sketch (Lopez de Prado, AFML ch.3)
def apply_triple_barrier(price, events, pt_mult, sl_mult, vol):
    out = events[['t1']].copy()
    for loc, t1 in events['t1'].items():
        df = price[loc:t1] / price[loc] - 1
        out.loc[loc, 'sl'] = df[df < -sl_mult * vol[loc]].index.min()
        out.loc[loc, 'pt'] = df[df >  pt_mult * vol[loc]].index.min()
    return out  # first of {pt, sl, t1} per row defines the label

Sample Weighting

Adjacent labels share overlapping information (the future-looking window of bar t contains bars t+1 … t+k). Training on them naively over-weights clustered observations. The fix is to weight each sample by the inverse of how many other samples its information overlaps with — again, López de Prado has the canonical recipe.

§ IV.

⊕

Trees, Linear, Kernels

Classical ML

Before reaching for transformers, exhaust the classical toolkit. Tree ensembles and well-regularised linear models are the workhorses of practitioner quant finance for reasons that have nothing to do with fashion: they are fast to train, robust to noisy features, interpretable, and difficult to overfit catastrophically. A gradient-boosted decision tree on 80 well-engineered features will beat a poorly-tuned LSTM in a walk-forward eight times out of ten.

The Roster

Model	Best at	Watch out for
Logistic regression (L1/L2)	Baseline classification, feature selection via L1	Linear decision boundary; needs interactions encoded
Ridge / Lasso / ElasticNet	Return regression baseline	Same — linear, but interpretable
Random Forest	Honest baseline, low tuning	Conservative; underfits sharp signals
Extra Trees	Faster RF variant, less overfit-prone	Slightly noisier predictions
XGBoost / LightGBM / CatBoost	The default for tabular financial ML	Overfits if max_depth too high, samples weighted wrong
SVM / SVR (RBF)	Small datasets, smooth decision boundaries	Scales badly past ~50k samples; sensitive to feature scaling
Hidden Markov Model	Regime detection (2–4 hidden states)	Latent states often uninterpretable; assumes Markov property
Gaussian Mixture	Soft regime clustering	Number of components is a hyperparameter you cannot validate cleanly

The Honest Default

Start with LightGBM. Engineer 30–100 features. Predict either (a) the sign of next-bar return classified by triple barrier, or (b) the volatility-normalised return. Use purged time-series cross-validation. Inspect feature_importance with shap rather than gini — gini lies on correlated features. Retrain weekly or monthly. If LightGBM cannot find a stable edge, neither can a transformer.

import lightgbm as lgb
from sklearn.model_selection import TimeSeriesSplit

params = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.02,
    'num_leaves': 31,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'min_data_in_leaf': 200,   # crucial for noisy targets
    'lambda_l2': 0.1,
}
# Walk-forward, NEVER plain k-fold. Purge + embargo around test windows.

Failure mode — Boosting on weak labels

If your label is sign(r_{t+1}) on 1m bars, your signal-to-noise ratio is brutally bad and a deep tree will memorise the noise. Either move to a longer horizon where signal exists, or use triple-barrier labelling at an interval scaled to volatility. The number-one diagnostic: train AUC ≈ 0.99 and validation AUC ≈ 0.51 means you are fitting noise. Cap max_depth ≤ 6 and crank min_data_in_leaf.

§ V.

⌬

Sequences, Attention, Hierarchy

Deep Learning

Deep learning earns its place when the input has structure that hand-crafted features cannot easily express: long sequences, irregular spacing, multi-modal inputs, or non-linear interactions across hundreds of features. For BTC at hourly and above, a well-engineered LightGBM usually matches a deep model. For high-frequency order-book prediction, sub-minute regime classification, and multi-asset cross-sectional work, deep models pull ahead.

The Architectures That Matter

Architecture	Strength	Use when
MLP	Universal approximator; baseline	Comparing against tree ensembles on tabular features
LSTM / GRU	Variable-length sequence memory	Sub-hour bars; needs lots of data
1D CNN	Local pattern detection across time	Order book snapshots, candlestick patterns
Temporal Convolutional Network	Long receptive field, parallel training	Replacing LSTM in most settings
Transformer (vanilla)	Long-range attention; scales	When you have millions of training sequences
Informer / Autoformer / PatchTST	Forecasting-specific transformers	Multi-horizon point or quantile forecasting
Temporal Fusion Transformer	Multi-horizon + interpretable attention	When you need to explain which inputs mattered
N-BEATS / N-HiTS	Pure MLP, decomposes trend & seasonality	Surprisingly strong on univariate forecasts
DeepAR	Probabilistic forecasts via autoregressive RNN	When you need distributions, not points

Inputs & Outputs

A deep model for BTC is rarely a single number in, single number out. The shape you will most often build:

# A multi-horizon, multi-input sequence model
input_features:  [batch, seq_len=240, n_features=48]   # 240 hours of 48 features
static_features: [batch, n_static=12]                  # asset metadata, regime flags

# Output options:
direction:   [batch, n_horizons=4]    # P(up) at +1h, +4h, +24h, +96h
quantiles:   [batch, n_horizons, 7]   # 7 quantile forecasts per horizon
return_mean: [batch, n_horizons]      # point forecast of returns
return_std:  [batch, n_horizons]      # predicted volatility

Training Discipline

Standardise features by training-window statistics onlyComputing the mean/std on the full series is the most common lookahead bug.
Early stop on a held-out walk-forward windowNot on a random validation split — that lies for time series.
Use dropout, but on the input as well as hidden layersInput dropout corrupts the feature mix the model sees, which approximates a regime-shift augmentation.
Mix-up across timeLinear interpolation of nearby training samples regularises sequence models more than people expect.
Loss matching the targetUse quantile loss for distributional forecasts; use focal loss when classes are imbalanced (which they are during ranging regimes).
Predict deltas, not levelsAlways. Levels look impressive on a plot and are useless to trade.

What a Modern Stack Looks Like

The current best practice for BTC deep-learning research, circa 2026:

Layer	Tool
Framework	PyTorch + PyTorch Lightning (or Keras 3)
Forecasting library	PyTorch Forecasting · Nixtla NeuralForecast · Darts
Hyperparameter search	Optuna with TPE pruner
Tracking	MLflow or Weights & Biases
Serving	TorchServe · ONNX Runtime · BentoML

Failure mode — The transformer that learned the bias

A high-capacity model trained on a year of bull-market data will achieve excellent in-sample loss by learning that price goes up. The walk-forward look gorgeous until it doesn't. Mitigations: train across multiple market regimes, augment with synthetic bear/range periods, and use vol-normalised targets so the model cannot just memorise drift.

§ VI.

⊗

Decisions, Not Predictions

Reinforcement Learning

Supervised learning predicts the future. Reinforcement learning chooses actions. The distinction matters because in trading, the question is rarely "what will price do" and almost always "what should I do next, knowing what I currently hold and what it will cost me to change that." RL frames the problem natively as a sequential decision process and learns a policy end-to-end — including position sizing, holding periods, and tactical exits.

The MDP

StateCurrent features (price history, indicators, regime), current position, current unrealised P&L, time in trade, recent reward.
ActionDiscrete: {-1, 0, +1} for short/flat/long. Or continuous: target leverage in [-L, +L]. Continuous spaces are richer but harder to learn.
RewardThe deeply contested choice. Naïve: P&L per step. Better: differential Sharpe, log-utility return, or risk-adjusted return with explicit cost penalty.
TransitionThe market — non-stationary, partially observed, and not yours to define.

Algorithm Choices

Algorithm	Action space	Notes
DQN / Rainbow	Discrete	Sample-efficient. Sensitive to reward scale.
PPO	Both	The current workhorse. Robust, well-understood.
SAC	Continuous	Strong sample efficiency; entropy-regularised exploration.
A2C / A3C	Both	Simpler than PPO; often outperformed by it.
TD3	Continuous	Twin critics combat overestimation.
Decision Transformer	Sequence	Offline RL; conditions on desired return.

Reward Design Is Everything

The single largest determinant of whether an RL agent develops a sane trading policy is the reward function. A few patterns that consistently perform better than naïve P&L:

# Differential Sharpe Ratio (Moody & Saffell, 2001) - online Sharpe approximation
def differential_sharpe(r_t, A_prev, B_prev, eta=0.01):
    delta_A = r_t - A_prev
    delta_B = r_t**2 - B_prev
    A = A_prev + eta * delta_A          # EMA of returns
    B = B_prev + eta * delta_B          # EMA of squared returns
    denom = (B_prev - A_prev**2)**1.5
    dsr   = (B_prev * delta_A - 0.5*A_prev*delta_B) / denom
    return dsr, A, B
# Reward = dsr - cost_penalty * |action_change|

Environments

Off-the-shelf options worth knowing: FinRL (the AI4Finance lab), gym-anytrading, TensorTrade. None will give you production-ready alpha out of the box — they exist to remove the boilerplate of building an OpenAI-Gym-compatible market environment so you can focus on the parts that matter (features, reward, action design).

Failure mode — The sim-to-real gap

An RL agent that trains in a frictionless simulator will discover policies that exploit the simulator: instant fills, no partial executions, perfect quote data. Move to live and the policy collapses. Cures: model transaction costs aggressively, simulate slippage as a function of order book depth, randomise execution latency in training, and validate on data the agent has never seen during reward shaping.

§ VII.

The Crypto-Native Edge

Sentiment & On-Chain

Equities have decades of fundamental data — earnings, balance sheets, analyst coverage. Crypto does not. What it has instead is an unusually transparent ledger and an unusually loud social layer. Both are exploitable — and both are the two feature families most often misused by retail-level practitioners, who treat raw on-chain charts as signals rather than as inputs to a model that has to decide when each one is informative.

On-Chain Signals That Matter

Metric	What it tells you	How it's typically used
MVRV ratio	Market value / realised value — aggregate cost basis position	Top/bottom regime indicator
NUPL	Net unrealised profit/loss across all holders	Same family as MVRV; sentiment composite
SOPR	Realised P&L on coins that moved that day	Above/below 1 = profit/loss-taking regime
Exchange net flow	Coins onto exchanges vs off	Inflow = supply pressure; outflow = HODL pressure
Miner balance / sells	Forced supply from miners	Capitulation indicator
Active addresses	Daily network usage	Adoption / engagement signal
Realised cap HODL waves	Coin age distribution	Long-term vs short-term holder regime
Stablecoin supply (USDT, USDC)	Available crypto-internal liquidity	Macro liquidity proxy

Derivatives Microstructure

Funding ratePersistently positive funding indicates a crowded long book. Extremes — say, > 0.05 % every 8 h sustained — historically precede long squeezes.
Open interestRising OI confirms a trend; falling OI suggests a move is unwinding.
Perp-spot basisThe premium of perpetual swap vs spot. Wide premiums = bullish positioning; deep discounts = capitulation or arbitrage opportunity.
Liquidation mapsCumulative leverage thresholds where forced selling/buying triggers. Price often hunts these zones.
Options skew & term structure25-delta put skew, vol surface curvature, calendar spreads. Hedging pressure leaves footprints.

Sentiment via LLMs (the 2024+ shift)

Before 2023, sentiment analysis in crypto meant FinBERT fine-tunes on tweet datasets and lexicon-based scoring. Both worked poorly: crypto Twitter speaks in irony, memes, and ticker abbreviations that drift weekly. The current state of the art is to use a frontier LLM (Claude, GPT-class, or a local Llama variant) as a feature extractor: feed it a window of posts and ask for structured outputs — bullish/bearish scores, conviction level, narrative tags, and influencer-weighted aggregates.

# Pseudo-pipeline. Run hourly. Cache aggressively.
for hour in recent_hours:
    posts   = fetch_posts(hour, sources=['twitter', 'reddit', 'telegram_pub'])
    posts   = filter_by_engagement(posts, min_views=500)
    chunks  = batch(posts, n=50)
    results = [llm.extract_sentiment(chunk) for chunk in chunks]
    features[hour] = aggregate(results, weights=influencer_score)

The catch: LLM inference is expensive at scale and rate-limited at the API. Most serious practitioners use a tiered system — a cheap classifier filters posts, an LLM scores the survivors, and aggregated scores are cached.

The Fear & Greed Index

The widely-cited Alternative.me index is a composite of volatility, momentum, social media, surveys, dominance, and Google trends. It is a useful indicator as a feature, not as a signal. Buying when fear < 20 and selling when greed > 80 has worked as a contrarian heuristic over long horizons but has substantial drawdowns and embeds severe lookahead if used naively in backtests.

Failure mode — On-chain as oracle

"Exchange outflows are massive — supply shock incoming." Posts like this are the on-chain analyst's version of cherry-picking. Outflows correlate with price moves, but the correlation is unstable across regimes, asymmetric in time, and contaminated by exchange wallet reclassifications. Treat every on-chain metric as one input feature to a model, never as a standalone signal.

§ VIII.

∇

Combining Imperfect Models

Ensembles & Meta-Learning

Every model in this framework is wrong in a different way. Tree ensembles miss long-range dependencies. LSTMs hallucinate trends. RL agents overfit to their reward function. Sentiment models lag price. The practical response is not to find the right model — it does not exist — but to combine several so that their idiosyncratic errors partially cancel and the parts each one gets right are weighted appropriately.

Ensemble Hierarchy

BaggingSame algorithm, different data subsets. Random forest is bagged trees. Cheap, robust.
BoostingSequential — each model fixes errors of the previous. XGBoost, LightGBM. Already an ensemble.
StackingTrain K diverse models. Train a meta-learner on their out-of-fold predictions. The meta-learner decides who to trust when.
BlendingStacking, but with a held-out blend set rather than out-of-fold predictions. Simpler, slightly less data-efficient.
Bayesian Model AveragingWeight models by their posterior probability of being correct. Theoretically elegant; rarely beats stacking in practice.
Mixture of ExpertsLearn a gating network that routes each input to the best specialist. Strong in regime-changing markets.
Online ensemble selectionWeight models by their recent walk-forward performance. Drops degraded models; promotes recovering ones.

A Working Composition

The shape of a production-ready BTC ML stack often looks like:

Layer	Models	Output
Base — direction	LightGBM + Logistic + TFT	P(up) at multiple horizons
Base — volatility	GARCH-X + LightGBM on \|r\|	σ̂ at decision horizon
Base — regime	HMM + GMM on macro features	P(trend), P(range), P(crash)
Meta-learner	Ridge / shallow LightGBM	Final P(up), final σ̂
Sizer	Vol-target + fractional Kelly	Position size

The meta-learner gets the base predictions plus regime probabilities as inputs, so it learns to trust LightGBM during trends and the TFT during regime transitions. This is not magic — it is a small, well-validated linear model whose entire job is to apologise for each base learner's blind spots.

The best model is a portfolio of mediocre models, each wrong in an uncorrelated way. — Practitioner folklore

§ IX.

⟁

The Path of Honest Tests

Validation

This is the section that decides whether a research project will lose money in live. Validation is to ML trading what controlled trials are to medicine: tedious, expensive, often unwelcome, and the only thing standing between a plausible-sounding hypothesis and a costly mistake. Every method below exists because a previous generation of quants discovered, painfully, that a simpler method lied to them.

Why Random K-Fold Lies

Shuffling time-series data and splitting into folds destroys the autocorrelation structure that you are trying to exploit, but it also leaks information from the future into the past. A model evaluated by random k-fold on price data routinely shows AUC of 0.85 when its real out-of-sample AUC is 0.52. The single most common cause of "my backtest looked amazing but live lost money" is random shuffling somewhere upstream.

The Validation Ladder

Method	What it controls for	Cost
Hold-out split	Most basic — train on first 70%, test on last 30%	One sample of validation error
Walk-forward (anchored)	Models trained on expanding window; tested on next slice	Multiple validation samples, time-respecting
Walk-forward (sliding)	Fixed-size training window — models forget old regimes	Tests for regime sensitivity
Purged k-fold	K-fold with a buffer ("purge") removing samples whose label horizon overlaps the test window	Closes the most common leakage
Purged + embargo	Adds an embargo period after each test fold to prevent train-side leakage of test information	López de Prado standard
Combinatorial Purged CV	All combinations of N test folds out of K; produces a distribution of paths	Most realistic for path-dependent strategies

Beyond Single Numbers

A single Sharpe ratio from a single backtest is a number with almost no statistical content. The real questions are:

How many strategies did you test before this one?If you tried 50 variants and reported the best, the Deflated Sharpe Ratio penalises that and often reveals it was noise.
What is the probability this strategy is overfit?The Probability of Backtest Overfitting (PBO) measures, across CPCV paths, how often the in-sample-best strategy is actually one of the worst out-of-sample.
Does the strategy survive bootstrap resampling?Block-bootstrap the trade returns. If the Sharpe distribution straddles zero, your edge is statistical noise.
Does the strategy survive a Monte Carlo permutation test?Shuffle the returns within blocks; recompute the strategy. If the original Sharpe is not in the top 5% of the permutation distribution, you have nothing.
Reality Check / SPA testsWhite's Reality Check and Hansen's SPA correct for multiple-testing across competing strategies. Use when comparing model families.

Deflated Sharpe Ratio

The deflated Sharpe ratio (Bailey & López de Prado, 2014) adjusts the observed Sharpe for the number of trials you ran, the variance of returns, skewness, and kurtosis. It produces a probability that the observed Sharpe is genuinely above zero given everything you tried. For a researcher who tested 100 variants and reports a Sharpe of 1.8 on the best one, the DSR is often below 0.5 — which means the result is not statistically significant.

# DSR sketch — Bailey & Lopez de Prado, 2014
# Inputs: observed Sharpe SR_obs, n_trials, sample length T,
#         return skew gamma_3, kurtosis gamma_4
SR_expected_max = 0.0 + sigma_SR * (
    (1 - GAMMA) * inv_norm(1 - 1/n_trials)
    + GAMMA * inv_norm(1 - (1/n_trials)*exp(-1))
)
denom = sqrt(1 - gamma_3*SR_obs + (gamma_4 - 1)/4 * SR_obs**2)
DSR  = norm_cdf((SR_obs - SR_expected_max) * sqrt(T-1) / denom)
# Interpretation: probability that true SR > 0 given the trials you ran.

Walk-Forward in Practice

Trainon data from t₀ to t₀ + W
Embargoa period (t₀+W, t₀+W+E) — discard, prevents leakage
Teston (t₀+W+E, t₀+W+E+H) — out-of-sample window
Advancet₀ by H (or by some smaller step for finer-grained metrics) and repeat
Aggregateconcatenate all test-window predictions; compute Sharpe, max drawdown, hit rate once on the union

The non-obvious leakages

Even disciplined walk-forward leaks future information if you: (a) tune hyperparameters on the whole dataset before running walk-forward; (b) use feature standardisation statistics computed on all data; (c) select which features to include based on the full-sample correlation; (d) use a model architecture you only chose because the full-sample Sharpe looked good. The discipline is recursive — every choice you made about the strategy is a choice that needs to be re-validated.

§ X.

Surviving the Tails

Risk & Position Sizing

An average ML model with disciplined sizing makes money. A brilliant ML model with greedy sizing goes bust. The math is asymmetric: a 50% drawdown requires a 100% recovery, and a 90% drawdown requires a 900% recovery. Most quant blow-ups are not modelling failures — they are sizing failures.

Sizing Methods

Method	Idea	Trade-off
Fixed-fractional	Risk X% per trade based on stop distance	Simple; ignores model conviction
Volatility targeting	Scale exposure to hit constant ex-ante vol (say, 15% annual)	Stabilises return distribution; lags reality during regime shifts
ATR-based	Size such that 1 ATR move = N % of capital	Works for trend-following
Kelly criterion	f* = edge / variance — theoretically optimal log-growth	Brutally aggressive at the optimum; sensitive to edge estimation error
Fractional Kelly	Half- or quarter-Kelly	Industry standard. Massively reduces drawdown.
Risk parity / vol parity	Across multiple instruments — equalise risk contributions	Relevant when you trade BTC + ETH + alts
CVaR-constrained	Maximise expected return subject to bounded conditional tail loss	Honest about tail risk; requires distributional model

The Kelly Math

# Continuous Kelly for a strategy with edge mu and variance sigma^2:
f_star = mu / sigma**2

# With proportional cost c per unit traded:
f_star_costed = (mu - c) / sigma**2

# In practice: half-Kelly (f_star / 2) cuts geometric growth by ~25%
# but reduces drawdown variance by ~75%. Almost always worth it.

Drawdown Control Mechanisms

Equity-curve filteringIf your strategy's recent rolling Sharpe drops below zero, reduce size or pause. The simplest "model" of model degradation.
Volatility regime cutsCap exposure when realised vol exceeds a threshold. Markets in crisis correlate to one — diversification fails when you need it.
Time stopsExit a position after N bars regardless of P&L. Prevents the slow-bleed scenario where a thesis is "still valid" while capital decays.
Maximum concurrent riskIf you run multiple sub-strategies, cap their summed gross exposure. Correlations spike in crashes.
Daily loss limitsHard stop at X% intraday loss. Resume tomorrow. Removes the worst variance from your monthly distribution.

Risk management is the strategy. The model is decoration. — overheard at a Citadel desk

§ XI.

⚙

From Notebook to Live

Production

A strategy that lives in a notebook is not a strategy — it is a research artefact. Moving from research to live is where most projects discover that 80% of the work is the part the academic papers do not cover: the infrastructure that pulls features in real time, serves predictions with millisecond consistency, monitors for drift, and rolls back gracefully when something goes wrong.

The Production Stack

Component	Purpose	Common tools
Data ingest	Websocket → time-series store	CCXT, native exchange SDKs, websockets + asyncio
Feature store	Single source of truth, point-in-time correct	Feast, Tecton, or DuckDB + S3 if you're small
Model registry	Versioned models with metadata	MLflow, Weights & Biases Artifacts
Inference	Serve predictions with bounded latency	ONNX Runtime, TorchServe, BentoML, or a thin FastAPI
Order management	Translate signals into orders; manage state	Hummingbot, NautilusTrader, or homegrown
Execution	Smart order routing; slippage minimisation	TWAP/VWAP/POV slicers, iceberg orders
Monitoring	Drift detection, P&L attribution, alerting	Grafana + Prometheus, custom dashboards
Logging & audit	Every decision reproducible after the fact	Structured JSON logs, append-only event store

The Latency Budget

For an hourly-bar strategy, latency is mostly irrelevant — you can wait two seconds for a model to score. For a minute-bar strategy, you have perhaps 200ms total budget from bar close to order on the exchange. For sub-second strategies, you need colocation, kernel-bypass networking, and pre-compiled prediction pipelines. Most retail-scale BTC strategies operate at the 1m–1h timeframe where this is not a binding constraint.

Drift Detection

Models degrade. The question is whether you detect it before the loss is large enough to matter.

Feature driftTest if today's feature distributions differ from training. Kolmogorov-Smirnov, PSI (Population Stability Index), Jensen-Shannon divergence.
Prediction driftDistribution of model outputs over time. A model that suddenly predicts "up" 90% of the time when it used to be balanced is broken.
Performance driftRolling Sharpe, rolling hit rate. The lagging but ultimate measure.
Concept driftThe relationship between features and target has changed. Detect via rolling correlation of predicted vs realised, or by retraining on recent windows and comparing.

Deployment Discipline

Paper-trade firstRun live for at least one full market cycle (a month minimum) before risking capital. Compare paper P&L to backtest expectation.
Ramp capital slowlyStart at 5% of intended size. Double weekly if metrics hold. Full size only after 30+ trades in line with expectation.
Kill switchesManual and automatic. An external dashboard that can flatten everything in one click. Daily loss limits that trigger automatic stand-down.
Audit every fillCompare expected vs realised slippage. If realised consistently exceeds expected, your cost model is wrong and your strategy is smaller than you thought.
Retrain cadenceDefined in advance, not panicked. Weekly, monthly, or trigger-based on drift alarms. Pre-commit to the schedule.

§ XII.

⚖

Honest Skepticism

What Does Not Work

If you came here looking for confirmation that AI will predict BTC for you — this section won't. But it's the section a serious practitioner has internalised before they spend a year on the rest of the framework.

The base rate of failure is brutal

The most cited estimate from quant-fund-of-fund surveys is that fewer than one in twenty quant strategies that pass internal validation produce positive risk-adjusted returns over five years of live trading. For retail-built crypto ML strategies the rate is worse — closer to one in fifty. The reason is not that the techniques are bad. It is that the bar for "passes validation" is usually too low and the strategies degrade faster than they can be re-validated.

"My backtest has a 3.0 Sharpe" almost always means something else

A backtest Sharpe of 3.0 in cryptocurrency is overwhelmingly more likely to indicate (a) lookahead bias, (b) survivor bias in the data, (c) unrealistic execution assumptions, (d) p-hacking from running 100+ variants, or (e) a calendar bug — than to indicate a real edge. Real, surviving systematic crypto strategies run by professional teams typically realise Sharpe 0.7–1.5 net of costs. A 3.0 should make you skeptical of yourself, not proud.

Costs eat everything below a certain horizon

At Binance taker fee tiers, a round-trip costs ~8 bps. Add slippage at 1–5 bps and you need an edge per trade larger than 10–13 bps just to break even. Most 1-minute "predictive" signals have edges of 1–3 bps per trade. The math does not work — and no amount of model complexity changes the math.

Crypto's regime change problem is uniquely severe

An equities trader can argue that the 1990s S&P and the 2020s S&P share some structural commonalities — fundamentals exist, mean reversion exists, similar players. BTC in 2017 (retail-driven mania), 2019 (sideways recovery), 2021 (DeFi + institutional onramp), 2023 (post-FTX deleveraging), and 2025 (ETF-flow driven) are five almost unrelated markets. A model trained on any one of them will fail when the next arrives.

The alpha decay clock

Any edge you find that is genuine will be found by others. Capacity is finite. Crowded factors decay — and you will not know which side of the crowd you are on until the decay is well underway. Build the monitoring (§XI) that tells you when your edge is degrading, and have the discipline to retire degraded strategies rather than "wait it out."

The famous failures

LTCM had two Nobel laureates and blew up in 1998. Numerai (a serious crowd-sourced ML platform with significant resources) has been live for years and posts modest absolute returns. Renaissance Medallion is the cited counter-example, and it (a) is closed, (b) trades thousands of instruments not one, and (c) operates at frequencies you cannot replicate. There is no public crypto ML fund with a sustained track record meaningfully better than buy-and-hold over a full cycle. The absence is informative.

What This Means

None of the above means do not try. It means that the realistic ambition for an individual practitioner is not "build an oracle that prints money" but rather: build a system whose risk-adjusted returns marginally exceed a vol-targeted buy-and-hold of BTC, after honest costs, over a multi-year horizon, with drawdowns you can stomach. That is a real, hard, valuable engineering problem. It is not the problem most people say they are working on, but it is the one worth working on.

§ XIII.

📖

The Canon

Books That Matter

The reading order below moves from foundations to specialisation. None of them will make you money on its own. Together they build the worldview a serious practitioner needs.

Foundations

The Elements of Statistical Learning

Hastie, Tibshirani & Friedman · 2009

Still the reference for classical ML. Free PDF from Stanford. Read at least chapters 2, 3, 7, 9, 10, 15.

Pattern Recognition and Machine Learning

Christopher Bishop · 2006

Bayesian flavour. Strong on the probabilistic grounding most practitioners skip.

Deep Learning

Goodfellow, Bengio & Courville · 2016

Free online. Dated on architectures, definitive on fundamentals.

Reinforcement Learning: An Introduction

Sutton & Barto · 2nd ed. 2018

The canonical RL text. Read it before you build a trading agent.

Finance ML

Advances in Financial Machine Learning

Marcos López de Prado · 2018

The most important book in this field. Triple-barrier labelling, fractional differentiation, purged CV, CPCV, deflated Sharpe — all originate or are formalised here.

Machine Learning for Asset Managers

Marcos López de Prado · 2020

Shorter, more accessible. Covers covariance shrinkage, clustering for portfolio construction, signal-from-noise.

Active Portfolio Management

Grinold & Kahn · 2nd ed. 2000

Pre-ML, but the fundamental law of active management framework is permanent. Information ratio, transfer coefficient, breadth.

Quantitative Trading

Ernest Chan · 2009

Practical, opinionated. Useful counterweight to academic flavour of López de Prado.

Trading Evolved

Andreas Clenow · 2019

Python-heavy. Walks through implementation realities most academic books skip.

Algorithmic Trading

Ernest Chan · 2013

Mean reversion, momentum, regime switching. Examples are equities but the patterns transfer.

Microstructure & Execution

Market Microstructure in Practice

Lehalle & Laruelle · 2nd ed. 2018

How order books actually behave; how execution actually costs you.

Algorithmic and High-Frequency Trading

Cartea, Jaimungal & Penalva · 2015

The stochastic-control flavour of execution and market making.

Probability & Risk

The Concepts and Practice of Mathematical Finance

Mark Joshi · 2nd ed. 2008

Best single overview of the quantitative-finance worldview.

The Black Swan

Nassim Taleb · 2007

Read once. Internalise the lesson about tail risk. Then read his Incerto more selectively.

When Genius Failed

Roger Lowenstein · 2000

The LTCM story. The most expensive lesson in over-leveraged "I have a model" hubris ever written.

§ XIV.

∞

Tools & References

Resources

Papers — Start Here

Bailey & LdP 2014

The Deflated Sharpe Ratio. The single most important paper on backtest validity in this entire bibliography.

Bailey et al. 2017

The Probability of Backtest Overfitting. Quantifies how many strategies you tested before reporting the winner.

López de Prado 2018

Combinatorial Purged Cross-Validation. The validation method for path-dependent strategies.

Moody & Saffell 2001

Learning to Trade via Direct Reinforcement. Differential Sharpe ratio reward; foundational RL-for-trading.

Lim et al. 2021

Temporal Fusion Transformers. Multi-horizon, interpretable attention. A current SOTA for time-series forecasting.

Nie et al. 2022

PatchTST. Patches + transformer for long-horizon time series. Strong on financial benchmarks.

Oreshkin et al. 2019

N-BEATS. Pure MLP forecasting model; surprisingly strong univariate baseline.

Liu et al. 2022

FinRL. Open-source deep RL framework for quantitative finance.

Python Libraries

scikit-learn

Classical ML default. Use the pipeline + TimeSeriesSplit primitives.

LightGBM · XGBoost · CatBoost

The three gradient-boosting libraries. LightGBM is fastest, XGBoost most battle-tested, CatBoost handles categoricals best.

PyTorch · Lightning

Deep learning. Lightning removes the training-loop boilerplate.

PyTorch Forecasting

TFT, DeepAR, N-BEATS pre-implemented. Sane data interface.

Nixtla (NeuralForecast, StatsForecast)

Time-series forecasting library family. Includes N-BEATS, N-HiTS, PatchTST.

Darts

Unified API across statistical, ML, and deep models for time series.

Stable-Baselines3

RL algorithms (PPO, SAC, TD3, DQN) with the standard Gym interface.

FinRL

RL environments and benchmarks for finance. Crypto envs included.

backtesting.py · vectorbt · NautilusTrader

Three backtest engines at three scales: simple to industrial. Match to your strategy complexity.

CCXT

Unified exchange API. The standard for crypto data ingest and execution.

Optuna

Hyperparameter optimisation. Use the TPE sampler and median pruner.

MLflow · Weights & Biases

Experiment tracking and model registry. Pick one. Use it from day one.

SHAP

Feature attribution. Gini importance lies on correlated features; SHAP doesn't.

river · skforecast

Online learning and recursive forecasting libraries — when you need to update models incrementally.

Data Vendors

Binance · Coinbase APIs

Free OHLCV, trades, order book, funding. The default starting point.

Kaiko · Tardis.dev

Paid tick-level history. Essential beyond research-prototype.

Glassnode · CryptoQuant · IntoTheBlock

On-chain metric providers. Paid tiers needed for depth and point-in-time.

Coinglass

Aggregated derivatives data — OI, funding, liquidation maps. Free tier useful.

Deribit

Crypto options data. Vol surface, term structure, skew.

FRED

Macro time-series. DXY, yields, monetary aggregates. Free.

Communities & Ongoing

Quantitative Finance SE

stackexchange.com/quant. Higher signal than most subreddits.

arXiv q-fin

arxiv.org/list/q-fin. New papers daily. Most are wrong; a few are excellent.

SSRN

Working papers in finance. Better signal-to-noise than arXiv for empirical work.

Numerai forum

Crowd-sourced ML for equities, but the discussions on feature engineering, neutralisation, and ensembling transfer to crypto.

Hudson & Thames blog

López de Prado's commercial implementation team. Strong technical posts.

QuantStart · QuantPedia

Aggregations of academic and practitioner strategies. Useful as a survey starting point.

Contents

The Premise

The Four Failure Modes

Data

Sources

Frequencies & Horizons

Information Bars (López de Prado)

Quality Problems You Will Have

Storage

Features

Returns & Volatility

Momentum & Mean Reversion

Microstructure & Order Flow

Cross-Asset & Regime

Fractional Differentiation

Labelling — The Triple Barrier Method

Sample Weighting

Classical ML

The Roster

The Honest Default

Failure mode — Boosting on weak labels

Deep Learning

The Architectures That Matter

Inputs & Outputs

Training Discipline

What a Modern Stack Looks Like

Failure mode — The transformer that learned the bias

Reinforcement Learning

The MDP

Algorithm Choices

Reward Design Is Everything

Environments

Failure mode — The sim-to-real gap

Sentiment & On-Chain

On-Chain Signals That Matter

Derivatives Microstructure

Sentiment via LLMs (the 2024+ shift)

The Fear & Greed Index

Failure mode — On-chain as oracle

Ensembles & Meta-Learning

Ensemble Hierarchy

A Working Composition

Validation

Why Random K-Fold Lies

The Validation Ladder

Beyond Single Numbers

Deflated Sharpe Ratio

Walk-Forward in Practice

The non-obvious leakages

Risk & Position Sizing

Sizing Methods

The Kelly Math

Drawdown Control Mechanisms

Production

The Production Stack

The Latency Budget

Drift Detection

Deployment Discipline

What Does Not Work

The base rate of failure is brutal

"My backtest has a 3.0 Sharpe" almost always means something else

Costs eat everything below a certain horizon

Crypto's regime change problem is uniquely severe

The alpha decay clock

The famous failures

What This Means

Books That Matter

Foundations

Finance ML

Microstructure & Execution

Probability & Risk

Resources

Papers — Start Here

Python Libraries

Data Vendors

Communities & Ongoing