LSTM vs ARIMA: Who Wins in Crypto Forecasting?
The financial market is not a straight line. It is a system defined by Chaos, Non-linearity, and is riddled with Stochastic noise.
For decades, Quants have attempted to tame this beast using Econometrics:
- ARIMA was used to capture linear trends based on the assumption of Stationarity.
- GARCH (Generalized Autoregressive Conditional Heteroskedasticity) emerged as a massive leap forward to handle Heteroskedasticity (changing variance) and Volatility Clustering—a specialty of Crypto markets.
However, both ARIMA and GARCH face a mathematical barrier they cannot overcome: The rigidity of distribution assumptions. GARCH assumes that residuals follow a Gaussian or Student-t distribution, and it only "sees" short-term volatility. But when faced with a macro event from 6 months ago (like the Bitcoin Halving) creating a butterfly effect for today's price, these statistical models are completely "blind."
We need a model unbound by fixed probability distributions, capable of "remembering" non-linear patterns over infinite time horizons.
That is why Recurrent Neural Networks (RNN) and their pinnacle, Long Short-Term Memory (LSTM), were born. This article will bypass the surface-level code to dissect the mathematical essence (Calculus & Linear Algebra) inside this "Black Box," answering the question: Why can LSTM learn rules that both GARCH and ARIMA miss?
The Failure of RNN: The Tragedy of Chain Multiplication
To understand the greatness of LSTM, we must look at the "graveyard" of its predecessor: the RNN.
A basic RNN operates on a recursive formula, merging the old state into the new state:
Theoretically, contains information from the entire past. But when trained using Backpropagation Through Time (BPTT), RNNs encounter a fatal mathematical error.
The Vanishing Gradient Problem
To update the weights , we need to calculate the derivative of the Loss function (
) using the Chain Rule. This derivative depends on the product of partial derivatives across time steps:
Pay close attention to this deadly Product Term:
Since the activation function is Tanh (whose derivative is always in the range ), when you multiply hundreds of numbers smaller than 1 consecutively (representing hundreds of past candles), the result approaches zero extremely fast.
The Mathematical Essence: This means the error signal from the future cannot propagate back to the distant past. The neural network becomes "blind" to events that happened too long ago. This is why RNNs cannot capture Long-term dependencies.
LSTM: The "Cell" Architecture & A Complete Overhaul
If we consider an RNN as someone with short-term memory (like a goldfish), constantly "overwriting" old memories with new ones at every time step, then LSTM is an organized storage system. It possesses a dedicated hard drive (Cell State) and 3 intelligent "Gatekeepers" deciding what information flows in and out.
The greatest innovation by Hochreiter & Schmidhuber (1997) lies in separating the Long-term Memory () from the Short-term Memory (
)
A. The Cell State (): The Information "Superhighway"
In an RNN, the hidden state () is continuously transformed via non-linear activation functions (tanh) at every step. This distorts the original information.
In contrast, in an LSTM, the Cell State () runs straight down the entire chain like a conveyor belt.
- Information on this belt interacts primarily via Linear Addition (+) and Pointwise Multiplication.
- Mathematical Significance: This allows the gradient flow to pass through unchanged without being "squeezed" by activation functions. This is the key to solving the Vanishing Gradient problem.
B. The "Gating" Mechanism: 3 Smart Filters
LSTM uses 3 gates to protect and update the Cell State. Each gate is a small neural network layer using the Sigmoid function ():
- : Close valve (Block information).
- : Open valve (Let everything through).
a) Forget Gate (): "Taking out the trash"
The first step is deciding what information from the past () is no longer valuable and needs to be deleted.
Why is this needed? In finance, context changes constantly. If a Bullish Trend has been broken (Breakout), keeping the memory of "currently Uptrend" will introduce noise to the current prediction. The Forget Gate helps the model "Reset" the state when the context shifts.
b) Input Gate () & Candidate Memory (): "Loading new data"
This step decides what new information () is worthy of being stored in long-term memory. It involves two parallel operations:
- Part 1 - The Filter: The Input Gate
decides the importance of the new information.
- Part 2 - The Content: Creating a "Candidate Memory" vector
containing pure new information (using Tanh to normalize between
).
Updating the Cell State (): The Holy Addition
Here, old memories and new memories are blended:
The Breakthrough: It is the ADDITION (+) sign in the middle of this formula that allows the Gradient to run backwards to the past without decaying (as explained in the derivative section).
c) Output Gate (): "Extracting Information"
Finally, the model needs to calculate the hidden state () to make a prediction for the next step. Note that: The Memory () contains a lot of information, but not all of it is necessary for the present moment.
- Output Gate (): Decides which part of the memory will be "published."
- Calculating :
: Take the memory
, pass it through tanh to push values to
, then filter it through the Output Gate.
Why LSTM Solves Vanishing Gradients?
Let's look at the Cell State update formula again:
When we perform Backpropagation to calculate the derivative of with respect to :
The difference lies in the ADDITION (+) sign.
- In RNN: The derivative is a continuous MULTIPLICATION Value decays exponentially.
- In LSTM: The derivative is propagated through ADDITION.
If the Forget Gate () is open (approximately 1), the derivative . This creates a "Gradient Highway", allowing the error signal to run backwards from time step to without degrading.
Mathematical Simulation
To visualize this better, let's run a small Dry Run simulation with a scenario: Predicting Bitcoin price at Day 50 based on the news "Tesla buys BTC" which occurred on Day 1.
Scenario 1: RNN
The gradient propagates back to the past via continuous multiplication with weights (e.g., 0.5).
Result: The number is effectively 0. The RNN at Day 50 considers the Tesla event as if it never happened. Learning fails.
Scenario 2: LSTM
Assume the Forget Gate is open () because the model learned that this news is structurally important and needs to be retained.
Result: The signal is 100% preserved. LSTM still "remembers" the old news and uses it to forecast current prices.
Scenario 3: The "Valve" Mechanism
Let's see how LSTM reacts flexibly to Noise (short-term FUD) vs. Structural News (Exchange Collapse). Assume market sentiment is currently Uptrend ().
Case A: Noise (Binance Maintenance)
- LSTM calculates: This news is irrelevant.
- Forget Gate (Keep old memory).
- Input Gate (Block new noise).
- Result: Still Uptrend.
Case B: Crash (FTX Bankruptcy - Reversal)
- LSTM calculates: This news changes market structure.
- Forget Gate (Flush out the old Uptrend memory).
- Input Gate (Load all the bad news).
- Result:
This ability to flexibly open/close valves based on Context is exactly what rigid linear statistical models cannot do.
Conclusion
From a purely mathematical perspective, LSTM is not magic. It is a cleverly designed topological structure engineered to overcome the limitations of calculus on long data sequences.
By converting the problem from Product (Multiplication) to Sum (Addition), LSTM solves the Long-term Memory problem. However, remember that perfect math does not guarantee profit. LSTM handles past data well, but financial markets always contain stochasticity that no deterministic function can model with 100% accuracy.