Glossary

Transformer

The neural network architecture introduced in 2017 that powers virtually all modern large language models, including GPT, Claude, and Gemini.

January 15, 2026

Where It Came From

In 2017, a team at Google published a paper called "Attention is All You Need." They introduced a new neural network design called the transformer. Within a few years, it had replaced earlier architectures and become the foundation of nearly every major AI breakthrough — GPT, Claude, Gemini, BERT, and hundreds of others.

The Key Idea: Attention

The breakthrough in transformers is a mechanism called self-attention. When processing a word in a sentence, the model doesn't just look at neighbouring words — it weighs the relevance of every other word in the input simultaneously.

Consider the sentence "The trophy didn't fit in the suitcase because it was too big." What does "it" refer to — the trophy or the suitcase? A human reasons through the full sentence. The attention mechanism gives transformers a similar ability, allowing them to capture long-range relationships in text.

Why Transformers Beat Earlier Architectures

Before transformers, language models used architectures called RNNs (Recurrent Neural Networks) that processed text word by word in sequence. This made them slow to train and poor at capturing relationships between distant words.

Transformers process all words in parallel, making them far faster to train on modern hardware — especially GPUs. This unlocked the ability to scale up dramatically and train on enormous datasets.

What This Means in Practice

You don't need to implement a transformer to work with AI. But understanding that LLMs are built on this architecture helps explain why they:

Handle long contexts better than older models
Can be fine-tuned efficiently for specific tasks
Are computationally expensive to run

The AI Family Tree: From Machine Learning to Agentic AI

8 min read

Read →

How Large Language Models Actually Work, No Math Required