The core idea
A recurrent neural network (RNN) is a neural network built to process sequences — text, audio, sensor readings, anything where order matters. Unlike a standard feed-forward network that sees a whole input at once, an RNN reads its input one step at a time and maintains a hidden state: a vector that acts as a running summary of everything it has seen so far. At each step the network combines the current input with the previous hidden state to produce a new hidden state, and optionally an output. The same set of weights is reused at every step, which is why the network is called recurrent.
How the hidden state carries memory
Imagine feeding the sentence “the cat sat” into an RNN one word at a time. After reading “the,” the hidden state holds a small amount of context. After “cat,” it updates to reflect both words. By the time it reaches “sat,” the hidden state in principle encodes the whole phrase. This looping structure lets earlier inputs influence later predictions — essential for tasks like predicting the next word, tagging parts of speech, or transcribing speech. The hidden state is the RNN’s only memory, so its size and how well it is updated determine how much context the network can hold.
The vanishing gradient problem
RNNs are trained with backpropagation through time, which unrolls the loop and propagates the error backwards across every step. The trouble is that this involves multiplying many gradients together. If those values are consistently less than one, the product shrinks toward zero — the vanishing gradient problem — and the network simply cannot learn relationships that span many steps. If the values are larger than one, gradients can explode instead. Either way, vanilla RNNs struggled badly with long-range dependencies, forgetting the start of a long sentence by the time they reached the end.
LSTMs and GRUs to the rescue
The fix came from adding gates. A Long Short-Term Memory (LSTM) unit keeps a separate memory cell alongside the hidden state and uses input, forget, and output gates to control what gets stored, discarded, and surfaced at each step. The Gated Recurrent Unit (GRU) is a simpler variant with just two gates. By learning when to hold onto information and when to let it go, these architectures keep gradients flowing more stably over long sequences. For most of the 2010s, LSTMs powered machine translation, speech recognition, and early language models.
Why transformers took over
Despite their success, recurrent models share a fundamental limitation: they process sequences sequentially, one step after another, which is hard to parallelise and still loses information over very long ranges. The 2017 transformer architecture sidestepped recurrence entirely, using attention to let every position interact with every other position at once. This trained dramatically faster on GPUs and modelled long-range dependencies more directly. As a result, transformers replaced RNNs for most language tasks — though RNNs remain useful in low-resource, streaming, and embedded settings where their compact, step-by-step nature is an advantage.