arXiv preprint · 2022
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones
Anthropic
Introduces Constitutional AI (CAI): an alignment approach where an LLM critiques and revises its own outputs according to a written constitution of principles, with reinforcement learning from AI feedback (RLAIF) replacing the human-labeling step. Demonstrates that RLAIF can produce models that are both more helpful AND more harmless than RLHF baselines, while scaling alignment without proportional human labeling effort.
Our summary in our own words — see the canonical source links below for the original abstract.
Constitutional AI is the foundation of Claude's training pipeline at Anthropic — the headline 'safety-first' frontier lab in our engineering-acceleration tracking. The RLAIF paradigm addresses RLHF's scaling bottleneck and has influenced subsequent alignment research across frontier labs.
Standard RLHF uses human preference data to train a reward model; Constitutional AI uses AI-generated preferences against a written constitution. Both pipelines produce aligned models; CAI scales without proportional human-labeling effort.
Anthropic. Authors include Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, and Andy Jones (arXiv:2212.08073, 2022).
Reinforcement Learning from AI Feedback — the technique introduced in this paper, where AI-generated preferences judged against a written constitution replace human preference labeling, allowing alignment to scale without proportional human effort.
It is the foundation of Anthropic's Claude training pipeline. The RLAIF approach has also influenced subsequent alignment research across other frontier labs.
NeurIPS 2017 · 2017
NeurIPS 2020 · 2020
NeurIPS 2022 · 2022
NeurIPS 2020 · 2020
ICLR 2022 · 2021
NeurIPS 2022 · 2022
Code-Side Sourcing methodology, replicable on the open dataset.
Read /methodology