NeurIPS 2017 · 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Google Brain · Google Research · University of Toronto
Introduces the Transformer architecture: a sequence-to-sequence model based entirely on attention mechanisms, dispensing with recurrence and convolutions. Demonstrates state-of-the-art results on English-to-German and English-to-French translation benchmarks with significantly less training time than the prior recurrent encoder-decoder models. The architecture's self-attention mechanism allows parallel processing of sequence elements and scales effectively with model size and data.
Our summary in our own words — see the canonical source links below for the original abstract.
The Transformer is the architectural foundation of every modern frontier LLM — GPT, Claude, Gemini, Mistral, Llama, Qwen, DeepSeek. Our engineering-acceleration tracking of AI infrastructure and agentic AI categories operates on a substrate that did not exist before this paper. We cite it as the foundational reference for the AI-native engineering surface our /signal corpus covers.
Every modern frontier LLM (GPT, Claude, Gemini, Mistral, Llama, Qwen) uses the Transformer architecture introduced here. Without this paper, the AI infrastructure and agentic AI categories we track would not exist in their current form.
The paper is freely available on arXiv (arXiv:1706.03762). It is one of the most-cited ML papers ever published. NeurIPS 2017 was the venue.
The eight authors were Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin — most working at Google Brain or Google Research at the time of publication.
A sequence-to-sequence neural network built entirely on self-attention, dispensing with recurrence and convolutions. Its parallelism and clean scaling behavior with model size and data made the modern LLM era possible.
NeurIPS 2020 · 2020
NeurIPS 2022 · 2022
NeurIPS 2020 · 2020
ICLR 2022 · 2021
arXiv preprint · 2022
NeurIPS 2022 · 2022
Code-Side Sourcing methodology, replicable on the open dataset.
Read /methodology