Transformer
The neural network architecture introduced in 2017 that powers virtually all modern large language models, including GPT, Claude, and Gemini.
January 15, 2026
Where It Came From
In 2017, a team at Google published a paper called "Attention is All You Need." They introduced a new neural network design called the transformer. Within a few years, it had replaced earlier architectures and become the foundation of nearly every major AI breakthrough — GPT, Claude, Gemini, BERT, and hundreds of others.
The Key Idea: Attention
The breakthrough in transformers is a mechanism called self-attention. When processing a word in a sentence, the model doesn't just look at neighbouring words — it weighs the relevance of every other word in the input simultaneously.
Consider the sentence "The trophy didn't fit in the suitcase because it was too big." What does "it" refer to — the trophy or the suitcase? A human reasons through the full sentence. The attention mechanism gives transformers a similar ability, allowing them to capture long-range relationships in text.
Why Transformers Beat Earlier Architectures
Before transformers, language models used architectures called RNNs (Recurrent Neural Networks) that processed text word by word in sequence. This made them slow to train and poor at capturing relationships between distant words.
Transformers process all words in parallel, making them far faster to train on modern hardware — especially GPUs. This unlocked the ability to scale up dramatically and train on enormous datasets.
What This Means in Practice
You don't need to implement a transformer to work with AI. But understanding that LLMs are built on this architecture helps explain why they:
- Handle long contexts better than older models
- Can be fine-tuned efficiently for specific tasks
- Are computationally expensive to run
Related articles
See also