What Is T5? Text-to-Text Transfer Transformer Explained

Treating every NLP task as text-in, text-out: Google's unified T5 model

Ad placeholder (leaderboard)

The core idea

T5, short for Text-to-Text Transfer Transformer, is a model Google introduced in 2019 with a deceptively simple unifying idea: treat every natural-language task as converting one piece of text into another. Whether the task is translation, summarisation, sentiment classification, or question answering, the input is text and the output is text. This framing means a single model, a single training objective, and a single decoding procedure can handle a huge range of problems — you just change the words you feed in and read out.

The text-to-text framing

In most NLP setups, different tasks need different output formats: a class label, a span, a number, a sentence. T5 abolishes that variety. To classify a review’s sentiment, you prepend an instruction like sst2 sentence: and the model outputs the literal word “positive” or “negative.” To translate, you prepend translate English to German: and it outputs German text. To summarise, you prepend summarize:. Because the output is always text, the same cross-entropy loss trains the model on every task at once, and the task prefix tells the model what to do.

Encoder-decoder architecture

Unlike encoder-only BERT or decoder-only GPT, T5 uses the full encoder-decoder transformer from the original 2017 paper. The encoder reads and builds a rich representation of the input text, and the decoder generates the output text one token at a time while attending back to the encoder’s representation. This structure is a natural fit for the text-to-text view, since many tasks genuinely map a complete input sequence to a complete output sequence — exactly what an encoder-decoder is built to do.

Span-corruption pre-training and C4

T5 is pre-trained with a span-corruption objective: random contiguous spans of the input are replaced with sentinel tokens, and the model must reconstruct the missing spans as its output. This is a text-to-text version of masked language modelling that suits the encoder-decoder design. The training data is the C4 dataset — the Colossal Clean Crawled Corpus — a carefully filtered slice of Common Crawl with boilerplate and junk removed. Training on this large, clean corpus gave T5 broad language competence before any task-specific fine-tuning.

Why T5 mattered

T5’s lasting contribution was conceptual as much as technical. By showing that one consistent text-to-text format could cover the whole NLP landscape, it made multi-task learning and transfer clean and systematic, and it foreshadowed the instruction-following style that defines today’s chat models — where you simply describe a task in words and read the answer back as text. T5 and its instruction-tuned descendant FLAN-T5 remain practical, efficient choices for summarisation, translation, and structured generation, and the text-to-text mindset is now everywhere in modern AI.

Ad placeholder (rectangle)