NeurIPS 2022 · 2022

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin

OpenAI

Abstract summary

Introduces InstructGPT and the RLHF (Reinforcement Learning from Human Feedback) pipeline: (1) collect demonstrations from human labelers for supervised fine-tuning, (2) collect human preference comparisons over model outputs to train a reward model, (3) optimize the LM against the reward model via PPO. Shows that this pipeline dramatically improves helpfulness, truthfulness, and harmlessness compared to the raw GPT-3 baseline, at a fraction of the parameter count.

Our summary in our own words — see the canonical source links below for the original abstract.

Why we cite this paper

RLHF is the alignment technique that turned raw foundation models into the instruct-tuned helpful-by-default behavior that ChatGPT, Claude, and Gemini exhibit. Our engineering-acceleration tracking of frontier-AI labs (Anthropic, OpenAI, etc.) and the agentic AI categories operates on a substrate where this paper's pipeline is the alignment baseline.

Key findings

1A 1.3B-parameter InstructGPT model outperforms 175B-parameter GPT-3 on human-preference evaluations after RLHF.
2Three-stage pipeline (SFT → reward model → PPO) became the de-facto alignment recipe for major frontier labs.
3Helpfulness, truthfulness, and harmlessness can be simultaneously improved without major capability loss.
4Modern alternatives (DPO, KTO, RLAIF) achieve similar results without the explicit reward-model step but inherit the framing.

Canonical sources

https://arxiv.org/abs/2203.02155 https://www.semanticscholar.org/paper/d766bffc357127e0dc86dd69561d5aeb520d6f4c

Related glossary terms

RLHF (Reinforcement Learning from Human Feedback)Fine-tuning Foundation Model

Frequently Asked Questions

What is RLHF?▾

Reinforcement Learning from Human Feedback — the training technique that aligns LLMs to human-preferred outputs after pretraining. See /define/rlhf for the full term definition.

Why is this paper considered foundational?▾

InstructGPT formalized the RLHF pipeline that ChatGPT, Claude, and Gemini training pipelines use as the alignment baseline. The paper turned LLMs from raw text-prediction models into instruction-following assistants.

What are the three stages of the RLHF pipeline?▾

(1) supervised fine-tuning on human-written demonstrations, (2) training a reward model on human preference comparisons over model outputs, and (3) optimizing the language model against that reward model with PPO reinforcement learning.

Did a smaller InstructGPT model beat GPT-3?▾

Yes. The paper reports that a 1.3B-parameter InstructGPT model was preferred by human evaluators over the 175B-parameter GPT-3 baseline — a roughly 100× parameter reduction at higher human-preference quality.

Five breakout startups, every Sunday — before the round gets crowded

The free Acceleration Watch: five venture-backed teams accelerating on the engineering signal, translated into plain English — 21 to 47 days before the deck circulates. No code-reading, no card.

Get the free Sunday issue →

Signed The Data Nerd · pseudonymous narrator · methodology over personality

Other research papers

NeurIPS 2017 · 2017

Read our own methodology paper

Code-Side Sourcing methodology, replicable on the open dataset.

Read /methodology

Training language models to follow instructions with human feedback

Abstract summary

Why we cite this paper

Key findings

Canonical sources

Related glossary terms

Frequently Asked Questions

Five breakout startups, every Sunday — before the round gets crowded

Other research papers

Attention Is All You Need

Language Models are Few-Shot Learners

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

LoRA: Low-Rank Adaptation of Large Language Models

Constitutional AI: Harmlessness from AI Feedback

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Read our own methodology paper

🚀 Explore Our Network

Training language models to follow instructions with human feedback

Abstract summary

Why we cite this paper

Key findings

Canonical sources

Related glossary terms

Frequently Asked Questions

Five breakout startups, every Sunday — before the round gets crowded

Other research papers

Attention Is All You Need

Language Models are Few-Shot Learners

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

LoRA: Low-Rank Adaptation of Large Language Models

Constitutional AI: Harmlessness from AI Feedback

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Read our own methodology paper