A practitioner's reference on machine-learning prediction for BTC/USDT — data, features, models, validation, sizing, and the production stack that holds it all together.
Every machine-learning trading framework begins by lying to itself about how hard the problem is. The lie usually sounds reasonable — markets have patterns, neural networks find patterns, therefore neural networks find market patterns. The honest version is that markets are adaptive, non-stationary, and adversarial, and any model good enough to extract a real edge is good enough to be arbitraged away by the next model that finds it.
What makes BTC/USDT specifically interesting — and specifically difficult — is the collision of three regimes. It trades like a tech equity during risk-on rallies, like gold during currency stress, and like nothing in particular during liquidation cascades. A model trained on 2017–2018 learns one BTC. A model trained on 2020–2021 learns a different one. A model trained on 2022–2024 learns a third. The honest question is not what predicts BTC but what predicts BTC during the regime we are about to enter, which is unknowable.
The framework below treats prediction not as an oracle problem but as a decision problem under uncertainty. We are not trying to forecast tomorrow's price; we are trying to assemble a system that, on average, sizes correctly, exits early, and survives the runs where it is wrong. The model is one component. Validation discipline, position sizing, execution latency, drawdown control, and the willingness to retire a degraded edge are each at least as important as the model itself.
The unfair advantage in quantitative finance does not live in the model. It lives in the data — in having more of it, cleaner versions of it, faster access to it, and the right kind of it. Spending three months tuning XGBoost on a dataset full of survivorship bias and unaligned timestamps is a more popular activity than spending three months getting the data right, but the second project produces strategies that work and the first does not.
| Source | What it provides | Notes |
|---|---|---|
| Binance / Coinbase / Bybit | OHLCV, trades, order book, funding rates, open interest | Free REST + websocket. Watch for outages and rate limits. |
| Kaiko / Tardis.dev | Tick-by-tick history, L2/L3 book snapshots, cross-venue | Paid. Essential for microstructure work. |
| Glassnode / CryptoQuant | On-chain — flows, NUPL, MVRV, miner positions | Paid for depth. Beware revision history. |
| Coinglass | Aggregate OI, liquidation maps, funding heatmaps | Useful as derivative-side cross-check. |
| FRED / Yahoo | DXY, SPX, VIX, US 10Y, gold — macro context | Free. Lower frequency. |
| Twitter / X · Reddit · Telegram | Sentiment, narrative shifts, influencer flow | API access expensive post-2023. See §VII. |
The choice of bar interval defines the strategy more than any single hyperparameter. A 1-minute model is a different beast — different features, different costs, different infrastructure — from a 4-hour model, even if the algorithm code is identical.
| Bar | Typical use | Edge source | Cost sensitivity |
|---|---|---|---|
| Tick / 100ms | Market-making, latency arb | Microstructure, queue position | Extreme — colocate or don't bother |
| 1m | Intraday momentum, mean reversion | Order flow, volatility clustering | High — fees dominate weak signals |
| 5m–15m | Intraday swing | Mixed micro + macro features | Moderate |
| 1h | Tactical positioning | Technicals, sentiment, funding | Low |
| 4h–1d | Trend, regime allocation | On-chain, macro, narrative | Negligible |
Time bars are a calendar artefact, not a market property. Volume, dollar, and imbalance bars sample by activity rather than by clock and produce returns that are closer to i.i.d. — which matters because most ML assumes it. A dollar bar that closes every \$50M traded breathes with the market: it slows down on quiet Sundays and speeds up during liquidation events, which is exactly the inverse of what time bars do.
For anything beyond a notebook prototype: store raw data once, store engineered features in a parquet-based warehouse, and version both. pandas + parquet works to about a year of 1m bars. Beyond that, look at DuckDB, arctic, or ClickHouse. Pickled DataFrames are a footgun — they bind to library versions.
Feature engineering is where domain expertise enters the model. A textbook ML practitioner trying their hand at crypto will typically build 500 features, none of them informed by how markets actually move, then complain that random forests don't work. The features below are the ones that consistently survive the validation gauntlet in practitioner research — not because they are magic, but because they encode something real about market microstructure or human behaviour.
| Family | Examples | Why |
|---|---|---|
| Log returns | r₁, r₅, r₁₅, r₆₀, r₂₄₀ (multi-horizon) | Stationary, additive across time |
| Realised vol | Std of r over rolling windows; Parkinson, Garman-Klass, Yang-Zhang | Clusters; predicts itself; gates position size |
| Vol-of-vol | Std of realised vol | Regime indicator |
| Skew & kurtosis | Rolling moments of returns | Tail behaviour; pre-crash signature |
The classical technicals — RSI, MACD, Bollinger Z-score, donchian width — are not magic, but they are compressions of price action that often correlate with whatever real edge exists. Use them as inputs to a model, not as standalone signals. The model decides when each one is informative.
BTC does not trade in isolation. Useful inputs from outside the BTC tape:
| Feature | What it captures |
|---|---|
| BTC/ETH spread | Crypto-internal risk appetite |
| BTC/SPX 30d corr | Risk-asset vs uncorrelated-asset regime |
| DXY change | Dollar strength — historically inverse BTC |
| VIX change | Equity vol regime; BTC sometimes follows, sometimes leads |
| US 10Y yield delta | Real-rate sensitivity |
| Stablecoin supply growth | Liquidity entering crypto |
The standard fix for non-stationary price series is to take first differences (returns). The cost is that all memory is destroyed — long-horizon levels become invisible. Fractional differentiation, popularised by López de Prado, takes a non-integer derivative that preserves as much memory as possible while passing a stationarity test. For BTC, fractional orders around 0.3–0.5 typically work.
Most beginners label as y = sign(return_t+k), which produces noisy targets dominated by drift. The triple-barrier method instead labels each observation by which of three events fires first: a profit-taking horizontal barrier, a stop-loss horizontal barrier, or a vertical time barrier. Labels become trading-decision-aligned, and the path matters, not just the endpoint.
# Triple-barrier sketch (Lopez de Prado, AFML ch.3) def apply_triple_barrier(price, events, pt_mult, sl_mult, vol): out = events[['t1']].copy() for loc, t1 in events['t1'].items(): df = price[loc:t1] / price[loc] - 1 out.loc[loc, 'sl'] = df[df < -sl_mult * vol[loc]].index.min() out.loc[loc, 'pt'] = df[df > pt_mult * vol[loc]].index.min() return out # first of {pt, sl, t1} per row defines the label
Adjacent labels share overlapping information (the future-looking window of bar t contains bars t+1 … t+k). Training on them naively over-weights clustered observations. The fix is to weight each sample by the inverse of how many other samples its information overlaps with — again, López de Prado has the canonical recipe.
Before reaching for transformers, exhaust the classical toolkit. Tree ensembles and well-regularised linear models are the workhorses of practitioner quant finance for reasons that have nothing to do with fashion: they are fast to train, robust to noisy features, interpretable, and difficult to overfit catastrophically. A gradient-boosted decision tree on 80 well-engineered features will beat a poorly-tuned LSTM in a walk-forward eight times out of ten.
| Model | Best at | Watch out for |
|---|---|---|
| Logistic regression (L1/L2) | Baseline classification, feature selection via L1 | Linear decision boundary; needs interactions encoded |
| Ridge / Lasso / ElasticNet | Return regression baseline | Same — linear, but interpretable |
| Random Forest | Honest baseline, low tuning | Conservative; underfits sharp signals |
| Extra Trees | Faster RF variant, less overfit-prone | Slightly noisier predictions |
| XGBoost / LightGBM / CatBoost | The default for tabular financial ML | Overfits if max_depth too high, samples weighted wrong |
| SVM / SVR (RBF) | Small datasets, smooth decision boundaries | Scales badly past ~50k samples; sensitive to feature scaling |
| Hidden Markov Model | Regime detection (2–4 hidden states) | Latent states often uninterpretable; assumes Markov property |
| Gaussian Mixture | Soft regime clustering | Number of components is a hyperparameter you cannot validate cleanly |
Start with LightGBM. Engineer 30–100 features. Predict either (a) the sign of next-bar return classified by triple barrier, or (b) the volatility-normalised return. Use purged time-series cross-validation. Inspect feature_importance with shap rather than gini — gini lies on correlated features. Retrain weekly or monthly. If LightGBM cannot find a stable edge, neither can a transformer.
import lightgbm as lgb from sklearn.model_selection import TimeSeriesSplit params = { 'objective': 'binary', 'metric': 'auc', 'learning_rate': 0.02, 'num_leaves': 31, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'min_data_in_leaf': 200, # crucial for noisy targets 'lambda_l2': 0.1, } # Walk-forward, NEVER plain k-fold. Purge + embargo around test windows.
If your label is sign(r_{t+1}) on 1m bars, your signal-to-noise ratio is brutally bad and a deep tree will memorise the noise. Either move to a longer horizon where signal exists, or use triple-barrier labelling at an interval scaled to volatility. The number-one diagnostic: train AUC ≈ 0.99 and validation AUC ≈ 0.51 means you are fitting noise. Cap max_depth ≤ 6 and crank min_data_in_leaf.
Deep learning earns its place when the input has structure that hand-crafted features cannot easily express: long sequences, irregular spacing, multi-modal inputs, or non-linear interactions across hundreds of features. For BTC at hourly and above, a well-engineered LightGBM usually matches a deep model. For high-frequency order-book prediction, sub-minute regime classification, and multi-asset cross-sectional work, deep models pull ahead.
| Architecture | Strength | Use when |
|---|---|---|
| MLP | Universal approximator; baseline | Comparing against tree ensembles on tabular features |
| LSTM / GRU | Variable-length sequence memory | Sub-hour bars; needs lots of data |
| 1D CNN | Local pattern detection across time | Order book snapshots, candlestick patterns |
| Temporal Convolutional Network | Long receptive field, parallel training | Replacing LSTM in most settings |
| Transformer (vanilla) | Long-range attention; scales | When you have millions of training sequences |
| Informer / Autoformer / PatchTST | Forecasting-specific transformers | Multi-horizon point or quantile forecasting |
| Temporal Fusion Transformer | Multi-horizon + interpretable attention | When you need to explain which inputs mattered |
| N-BEATS / N-HiTS | Pure MLP, decomposes trend & seasonality | Surprisingly strong on univariate forecasts |
| DeepAR | Probabilistic forecasts via autoregressive RNN | When you need distributions, not points |
A deep model for BTC is rarely a single number in, single number out. The shape you will most often build:
# A multi-horizon, multi-input sequence model input_features: [batch, seq_len=240, n_features=48] # 240 hours of 48 features static_features: [batch, n_static=12] # asset metadata, regime flags # Output options: direction: [batch, n_horizons=4] # P(up) at +1h, +4h, +24h, +96h quantiles: [batch, n_horizons, 7] # 7 quantile forecasts per horizon return_mean: [batch, n_horizons] # point forecast of returns return_std: [batch, n_horizons] # predicted volatility
The current best practice for BTC deep-learning research, circa 2026:
| Layer | Tool |
|---|---|
| Framework | PyTorch + PyTorch Lightning (or Keras 3) |
| Forecasting library | PyTorch Forecasting · Nixtla NeuralForecast · Darts |
| Hyperparameter search | Optuna with TPE pruner |
| Tracking | MLflow or Weights & Biases |
| Serving | TorchServe · ONNX Runtime · BentoML |
A high-capacity model trained on a year of bull-market data will achieve excellent in-sample loss by learning that price goes up. The walk-forward look gorgeous until it doesn't. Mitigations: train across multiple market regimes, augment with synthetic bear/range periods, and use vol-normalised targets so the model cannot just memorise drift.
Supervised learning predicts the future. Reinforcement learning chooses actions. The distinction matters because in trading, the question is rarely "what will price do" and almost always "what should I do next, knowing what I currently hold and what it will cost me to change that." RL frames the problem natively as a sequential decision process and learns a policy end-to-end — including position sizing, holding periods, and tactical exits.
| Algorithm | Action space | Notes |
|---|---|---|
| DQN / Rainbow | Discrete | Sample-efficient. Sensitive to reward scale. |
| PPO | Both | The current workhorse. Robust, well-understood. |
| SAC | Continuous | Strong sample efficiency; entropy-regularised exploration. |
| A2C / A3C | Both | Simpler than PPO; often outperformed by it. |
| TD3 | Continuous | Twin critics combat overestimation. |
| Decision Transformer | Sequence | Offline RL; conditions on desired return. |
The single largest determinant of whether an RL agent develops a sane trading policy is the reward function. A few patterns that consistently perform better than naïve P&L:
# Differential Sharpe Ratio (Moody & Saffell, 2001) - online Sharpe approximation def differential_sharpe(r_t, A_prev, B_prev, eta=0.01): delta_A = r_t - A_prev delta_B = r_t**2 - B_prev A = A_prev + eta * delta_A # EMA of returns B = B_prev + eta * delta_B # EMA of squared returns denom = (B_prev - A_prev**2)**1.5 dsr = (B_prev * delta_A - 0.5*A_prev*delta_B) / denom return dsr, A, B # Reward = dsr - cost_penalty * |action_change|
Off-the-shelf options worth knowing: FinRL (the AI4Finance lab), gym-anytrading, TensorTrade. None will give you production-ready alpha out of the box — they exist to remove the boilerplate of building an OpenAI-Gym-compatible market environment so you can focus on the parts that matter (features, reward, action design).
An RL agent that trains in a frictionless simulator will discover policies that exploit the simulator: instant fills, no partial executions, perfect quote data. Move to live and the policy collapses. Cures: model transaction costs aggressively, simulate slippage as a function of order book depth, randomise execution latency in training, and validate on data the agent has never seen during reward shaping.
Equities have decades of fundamental data — earnings, balance sheets, analyst coverage. Crypto does not. What it has instead is an unusually transparent ledger and an unusually loud social layer. Both are exploitable — and both are the two feature families most often misused by retail-level practitioners, who treat raw on-chain charts as signals rather than as inputs to a model that has to decide when each one is informative.
| Metric | What it tells you | How it's typically used |
|---|---|---|
| MVRV ratio | Market value / realised value — aggregate cost basis position | Top/bottom regime indicator |
| NUPL | Net unrealised profit/loss across all holders | Same family as MVRV; sentiment composite |
| SOPR | Realised P&L on coins that moved that day | Above/below 1 = profit/loss-taking regime |
| Exchange net flow | Coins onto exchanges vs off | Inflow = supply pressure; outflow = HODL pressure |
| Miner balance / sells | Forced supply from miners | Capitulation indicator |
| Active addresses | Daily network usage | Adoption / engagement signal |
| Realised cap HODL waves | Coin age distribution | Long-term vs short-term holder regime |
| Stablecoin supply (USDT, USDC) | Available crypto-internal liquidity | Macro liquidity proxy |
Before 2023, sentiment analysis in crypto meant FinBERT fine-tunes on tweet datasets and lexicon-based scoring. Both worked poorly: crypto Twitter speaks in irony, memes, and ticker abbreviations that drift weekly. The current state of the art is to use a frontier LLM (Claude, GPT-class, or a local Llama variant) as a feature extractor: feed it a window of posts and ask for structured outputs — bullish/bearish scores, conviction level, narrative tags, and influencer-weighted aggregates.
# Pseudo-pipeline. Run hourly. Cache aggressively. for hour in recent_hours: posts = fetch_posts(hour, sources=['twitter', 'reddit', 'telegram_pub']) posts = filter_by_engagement(posts, min_views=500) chunks = batch(posts, n=50) results = [llm.extract_sentiment(chunk) for chunk in chunks] features[hour] = aggregate(results, weights=influencer_score)
The catch: LLM inference is expensive at scale and rate-limited at the API. Most serious practitioners use a tiered system — a cheap classifier filters posts, an LLM scores the survivors, and aggregated scores are cached.
The widely-cited Alternative.me index is a composite of volatility, momentum, social media, surveys, dominance, and Google trends. It is a useful indicator as a feature, not as a signal. Buying when fear < 20 and selling when greed > 80 has worked as a contrarian heuristic over long horizons but has substantial drawdowns and embeds severe lookahead if used naively in backtests.
"Exchange outflows are massive — supply shock incoming." Posts like this are the on-chain analyst's version of cherry-picking. Outflows correlate with price moves, but the correlation is unstable across regimes, asymmetric in time, and contaminated by exchange wallet reclassifications. Treat every on-chain metric as one input feature to a model, never as a standalone signal.
Every model in this framework is wrong in a different way. Tree ensembles miss long-range dependencies. LSTMs hallucinate trends. RL agents overfit to their reward function. Sentiment models lag price. The practical response is not to find the right model — it does not exist — but to combine several so that their idiosyncratic errors partially cancel and the parts each one gets right are weighted appropriately.
The shape of a production-ready BTC ML stack often looks like:
| Layer | Models | Output |
|---|---|---|
| Base — direction | LightGBM + Logistic + TFT | P(up) at multiple horizons |
| Base — volatility | GARCH-X + LightGBM on |r| | σ̂ at decision horizon |
| Base — regime | HMM + GMM on macro features | P(trend), P(range), P(crash) |
| Meta-learner | Ridge / shallow LightGBM | Final P(up), final σ̂ |
| Sizer | Vol-target + fractional Kelly | Position size |
The meta-learner gets the base predictions plus regime probabilities as inputs, so it learns to trust LightGBM during trends and the TFT during regime transitions. This is not magic — it is a small, well-validated linear model whose entire job is to apologise for each base learner's blind spots.
This is the section that decides whether a research project will lose money in live. Validation is to ML trading what controlled trials are to medicine: tedious, expensive, often unwelcome, and the only thing standing between a plausible-sounding hypothesis and a costly mistake. Every method below exists because a previous generation of quants discovered, painfully, that a simpler method lied to them.
Shuffling time-series data and splitting into folds destroys the autocorrelation structure that you are trying to exploit, but it also leaks information from the future into the past. A model evaluated by random k-fold on price data routinely shows AUC of 0.85 when its real out-of-sample AUC is 0.52. The single most common cause of "my backtest looked amazing but live lost money" is random shuffling somewhere upstream.
| Method | What it controls for | Cost |
|---|---|---|
| Hold-out split | Most basic — train on first 70%, test on last 30% | One sample of validation error |
| Walk-forward (anchored) | Models trained on expanding window; tested on next slice | Multiple validation samples, time-respecting |
| Walk-forward (sliding) | Fixed-size training window — models forget old regimes | Tests for regime sensitivity |
| Purged k-fold | K-fold with a buffer ("purge") removing samples whose label horizon overlaps the test window | Closes the most common leakage |
| Purged + embargo | Adds an embargo period after each test fold to prevent train-side leakage of test information | López de Prado standard |
| Combinatorial Purged CV | All combinations of N test folds out of K; produces a distribution of paths | Most realistic for path-dependent strategies |
A single Sharpe ratio from a single backtest is a number with almost no statistical content. The real questions are:
The deflated Sharpe ratio (Bailey & López de Prado, 2014) adjusts the observed Sharpe for the number of trials you ran, the variance of returns, skewness, and kurtosis. It produces a probability that the observed Sharpe is genuinely above zero given everything you tried. For a researcher who tested 100 variants and reports a Sharpe of 1.8 on the best one, the DSR is often below 0.5 — which means the result is not statistically significant.
# DSR sketch — Bailey & Lopez de Prado, 2014 # Inputs: observed Sharpe SR_obs, n_trials, sample length T, # return skew gamma_3, kurtosis gamma_4 SR_expected_max = 0.0 + sigma_SR * ( (1 - GAMMA) * inv_norm(1 - 1/n_trials) + GAMMA * inv_norm(1 - (1/n_trials)*exp(-1)) ) denom = sqrt(1 - gamma_3*SR_obs + (gamma_4 - 1)/4 * SR_obs**2) DSR = norm_cdf((SR_obs - SR_expected_max) * sqrt(T-1) / denom) # Interpretation: probability that true SR > 0 given the trials you ran.
Even disciplined walk-forward leaks future information if you: (a) tune hyperparameters on the whole dataset before running walk-forward; (b) use feature standardisation statistics computed on all data; (c) select which features to include based on the full-sample correlation; (d) use a model architecture you only chose because the full-sample Sharpe looked good. The discipline is recursive — every choice you made about the strategy is a choice that needs to be re-validated.
An average ML model with disciplined sizing makes money. A brilliant ML model with greedy sizing goes bust. The math is asymmetric: a 50% drawdown requires a 100% recovery, and a 90% drawdown requires a 900% recovery. Most quant blow-ups are not modelling failures — they are sizing failures.
| Method | Idea | Trade-off |
|---|---|---|
| Fixed-fractional | Risk X% per trade based on stop distance | Simple; ignores model conviction |
| Volatility targeting | Scale exposure to hit constant ex-ante vol (say, 15% annual) | Stabilises return distribution; lags reality during regime shifts |
| ATR-based | Size such that 1 ATR move = N % of capital | Works for trend-following |
| Kelly criterion | f* = edge / variance — theoretically optimal log-growth | Brutally aggressive at the optimum; sensitive to edge estimation error |
| Fractional Kelly | Half- or quarter-Kelly | Industry standard. Massively reduces drawdown. |
| Risk parity / vol parity | Across multiple instruments — equalise risk contributions | Relevant when you trade BTC + ETH + alts |
| CVaR-constrained | Maximise expected return subject to bounded conditional tail loss | Honest about tail risk; requires distributional model |
# Continuous Kelly for a strategy with edge mu and variance sigma^2: f_star = mu / sigma**2 # With proportional cost c per unit traded: f_star_costed = (mu - c) / sigma**2 # In practice: half-Kelly (f_star / 2) cuts geometric growth by ~25% # but reduces drawdown variance by ~75%. Almost always worth it.
A strategy that lives in a notebook is not a strategy — it is a research artefact. Moving from research to live is where most projects discover that 80% of the work is the part the academic papers do not cover: the infrastructure that pulls features in real time, serves predictions with millisecond consistency, monitors for drift, and rolls back gracefully when something goes wrong.
| Component | Purpose | Common tools |
|---|---|---|
| Data ingest | Websocket → time-series store | CCXT, native exchange SDKs, websockets + asyncio |
| Feature store | Single source of truth, point-in-time correct | Feast, Tecton, or DuckDB + S3 if you're small |
| Model registry | Versioned models with metadata | MLflow, Weights & Biases Artifacts |
| Inference | Serve predictions with bounded latency | ONNX Runtime, TorchServe, BentoML, or a thin FastAPI |
| Order management | Translate signals into orders; manage state | Hummingbot, NautilusTrader, or homegrown |
| Execution | Smart order routing; slippage minimisation | TWAP/VWAP/POV slicers, iceberg orders |
| Monitoring | Drift detection, P&L attribution, alerting | Grafana + Prometheus, custom dashboards |
| Logging & audit | Every decision reproducible after the fact | Structured JSON logs, append-only event store |
For an hourly-bar strategy, latency is mostly irrelevant — you can wait two seconds for a model to score. For a minute-bar strategy, you have perhaps 200ms total budget from bar close to order on the exchange. For sub-second strategies, you need colocation, kernel-bypass networking, and pre-compiled prediction pipelines. Most retail-scale BTC strategies operate at the 1m–1h timeframe where this is not a binding constraint.
Models degrade. The question is whether you detect it before the loss is large enough to matter.
If you came here looking for confirmation that AI will predict BTC for you — this section won't. But it's the section a serious practitioner has internalised before they spend a year on the rest of the framework.
The most cited estimate from quant-fund-of-fund surveys is that fewer than one in twenty quant strategies that pass internal validation produce positive risk-adjusted returns over five years of live trading. For retail-built crypto ML strategies the rate is worse — closer to one in fifty. The reason is not that the techniques are bad. It is that the bar for "passes validation" is usually too low and the strategies degrade faster than they can be re-validated.
A backtest Sharpe of 3.0 in cryptocurrency is overwhelmingly more likely to indicate (a) lookahead bias, (b) survivor bias in the data, (c) unrealistic execution assumptions, (d) p-hacking from running 100+ variants, or (e) a calendar bug — than to indicate a real edge. Real, surviving systematic crypto strategies run by professional teams typically realise Sharpe 0.7–1.5 net of costs. A 3.0 should make you skeptical of yourself, not proud.
At Binance taker fee tiers, a round-trip costs ~8 bps. Add slippage at 1–5 bps and you need an edge per trade larger than 10–13 bps just to break even. Most 1-minute "predictive" signals have edges of 1–3 bps per trade. The math does not work — and no amount of model complexity changes the math.
An equities trader can argue that the 1990s S&P and the 2020s S&P share some structural commonalities — fundamentals exist, mean reversion exists, similar players. BTC in 2017 (retail-driven mania), 2019 (sideways recovery), 2021 (DeFi + institutional onramp), 2023 (post-FTX deleveraging), and 2025 (ETF-flow driven) are five almost unrelated markets. A model trained on any one of them will fail when the next arrives.
Any edge you find that is genuine will be found by others. Capacity is finite. Crowded factors decay — and you will not know which side of the crowd you are on until the decay is well underway. Build the monitoring (§XI) that tells you when your edge is degrading, and have the discipline to retire degraded strategies rather than "wait it out."
LTCM had two Nobel laureates and blew up in 1998. Numerai (a serious crowd-sourced ML platform with significant resources) has been live for years and posts modest absolute returns. Renaissance Medallion is the cited counter-example, and it (a) is closed, (b) trades thousands of instruments not one, and (c) operates at frequencies you cannot replicate. There is no public crypto ML fund with a sustained track record meaningfully better than buy-and-hold over a full cycle. The absence is informative.
None of the above means do not try. It means that the realistic ambition for an individual practitioner is not "build an oracle that prints money" but rather: build a system whose risk-adjusted returns marginally exceed a vol-targeted buy-and-hold of BTC, after honest costs, over a multi-year horizon, with drawdowns you can stomach. That is a real, hard, valuable engineering problem. It is not the problem most people say they are working on, but it is the one worth working on.
The reading order below moves from foundations to specialisation. None of them will make you money on its own. Together they build the worldview a serious practitioner needs.