NeurIPS 2022 · 2022
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin
OpenAI
Introduces InstructGPT and the RLHF (Reinforcement Learning from Human Feedback) pipeline: (1) collect demonstrations from human labelers for supervised fine-tuning, (2) collect human preference comparisons over model outputs to train a reward model, (3) optimize the LM against the reward model via PPO. Shows that this pipeline dramatically improves helpfulness, truthfulness, and harmlessness compared to the raw GPT-3 baseline, at a fraction of the parameter count.
Our summary in our own words — see the canonical source links below for the original abstract.
RLHF is the alignment technique that turned raw foundation models into the instruct-tuned helpful-by-default behavior that ChatGPT, Claude, and Gemini exhibit. Our engineering-acceleration tracking of frontier-AI labs (Anthropic, OpenAI, etc.) and the agentic AI categories operates on a substrate where this paper's pipeline is the alignment baseline.
Reinforcement Learning from Human Feedback — the training technique that aligns LLMs to human-preferred outputs after pretraining. See /define/rlhf for the full term definition.
InstructGPT formalized the RLHF pipeline that ChatGPT, Claude, and Gemini training pipelines use as the alignment baseline. The paper turned LLMs from raw text-prediction models into instruction-following assistants.
(1) supervised fine-tuning on human-written demonstrations, (2) training a reward model on human preference comparisons over model outputs, and (3) optimizing the language model against that reward model with PPO reinforcement learning.
Yes. The paper reports that a 1.3B-parameter InstructGPT model was preferred by human evaluators over the 175B-parameter GPT-3 baseline — a roughly 100× parameter reduction at higher human-preference quality.
NeurIPS 2017 · 2017
NeurIPS 2020 · 2020
NeurIPS 2020 · 2020
ICLR 2022 · 2021
arXiv preprint · 2022
NeurIPS 2022 · 2022
Code-Side Sourcing methodology, replicable on the open dataset.
Read /methodology