NeurIPS 2017 · 2017

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

Google Brain · Google Research · University of Toronto

Abstract summary

Introduces the Transformer architecture: a sequence-to-sequence model based entirely on attention mechanisms, dispensing with recurrence and convolutions. Demonstrates state-of-the-art results on English-to-German and English-to-French translation benchmarks with significantly less training time than the prior recurrent encoder-decoder models. The architecture's self-attention mechanism allows parallel processing of sequence elements and scales effectively with model size and data.

Our summary in our own words — see the canonical source links below for the original abstract.

Why we cite this paper

The Transformer is the architectural foundation of every modern frontier LLM — GPT, Claude, Gemini, Mistral, Llama, Qwen, DeepSeek. Our engineering-acceleration tracking of AI infrastructure and agentic AI categories operates on a substrate that did not exist before this paper. We cite it as the foundational reference for the AI-native engineering surface our /signal corpus covers.

Key findings

1Attention-only architectures match or exceed recurrent models on sequence-to-sequence tasks while training significantly faster.
2Self-attention scales effectively with model size, enabling the parameter regimes (1B–1T+) that define modern LLMs.
3Position encoding via learned or sinusoidal embeddings allows attention models to handle sequence order without recurrence.
4Multi-head attention captures different relationship types in parallel — a design choice that proved central to LLM expressiveness.

Canonical sources

https://arxiv.org/abs/1706.03762 https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776 https://openalex.org/works/W2963403868

Related glossary terms

Context Window Embedding Model Foundation Model

Frequently Asked Questions

Why is this paper considered foundational?▾

Every modern frontier LLM (GPT, Claude, Gemini, Mistral, Llama, Qwen) uses the Transformer architecture introduced here. Without this paper, the AI infrastructure and agentic AI categories we track would not exist in their current form.

Where can I read the canonical version?▾

The paper is freely available on arXiv (arXiv:1706.03762). It is one of the most-cited ML papers ever published. NeurIPS 2017 was the venue.

Who wrote Attention Is All You Need?▾

The eight authors were Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin — most working at Google Brain or Google Research at the time of publication.

What is the Transformer architecture?▾

A sequence-to-sequence neural network built entirely on self-attention, dispensing with recurrence and convolutions. Its parallelism and clean scaling behavior with model size and data made the modern LLM era possible.

Five breakout startups, every Sunday — before the round gets crowded

The free Acceleration Watch: five venture-backed teams accelerating on the engineering signal, translated into plain English — 21 to 47 days before the deck circulates. No code-reading, no card.

Get the free Sunday issue →

Signed The Data Nerd · pseudonymous narrator · methodology over personality

Other research papers

NeurIPS 2020 · 2020

Read our own methodology paper

Code-Side Sourcing methodology, replicable on the open dataset.

Read /methodology

Attention Is All You Need

Abstract summary

Why we cite this paper

Key findings

Canonical sources

Related glossary terms

Frequently Asked Questions

Five breakout startups, every Sunday — before the round gets crowded

Other research papers

Language Models are Few-Shot Learners

Training language models to follow instructions with human feedback

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

LoRA: Low-Rank Adaptation of Large Language Models

Constitutional AI: Harmlessness from AI Feedback

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Read our own methodology paper

🚀 Explore Our Network

Attention Is All You Need

Abstract summary

Why we cite this paper

Key findings

Canonical sources

Related glossary terms

Frequently Asked Questions

Five breakout startups, every Sunday — before the round gets crowded

Other research papers

Language Models are Few-Shot Learners

Training language models to follow instructions with human feedback

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

LoRA: Low-Rank Adaptation of Large Language Models

Constitutional AI: Harmlessness from AI Feedback

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Read our own methodology paper