arXiv preprint · 2022

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones

Anthropic

Abstract summary

Introduces Constitutional AI (CAI): an alignment approach where an LLM critiques and revises its own outputs according to a written constitution of principles, with reinforcement learning from AI feedback (RLAIF) replacing the human-labeling step. Demonstrates that RLAIF can produce models that are both more helpful AND more harmless than RLHF baselines, while scaling alignment without proportional human labeling effort.

Our summary in our own words — see the canonical source links below for the original abstract.

Why we cite this paper

Constitutional AI is the foundation of Claude's training pipeline at Anthropic — the headline 'safety-first' frontier lab in our engineering-acceleration tracking. The RLAIF paradigm addresses RLHF's scaling bottleneck and has influenced subsequent alignment research across frontier labs.

Key findings

1RLAIF (AI feedback) can substitute for RLHF (human feedback) at scale while maintaining alignment quality.
2A written 'constitution' of principles enables transparent control over model behavior.
3Models trained with CAI are both more helpful and more harmless than RLHF baselines on Anthropic's benchmarks.
4Scalable oversight via AI feedback is the path to alignment as models exceed human-evaluator capacity.

Canonical sources

https://arxiv.org/abs/2212.08073 https://www.semanticscholar.org/paper/5c4d44eb4b0d9c1eeed03a8bcccef957fce8a06b

Related glossary terms

RLHF (Reinforcement Learning from Human Feedback)Foundation Model

Frequently Asked Questions

How does Constitutional AI differ from RLHF?▾

Standard RLHF uses human preference data to train a reward model; Constitutional AI uses AI-generated preferences against a written constitution. Both pipelines produce aligned models; CAI scales without proportional human-labeling effort.

Who published the Constitutional AI paper?▾

Anthropic. Authors include Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, and Andy Jones (arXiv:2212.08073, 2022).

What is RLAIF?▾

Reinforcement Learning from AI Feedback — the technique introduced in this paper, where AI-generated preferences judged against a written constitution replace human preference labeling, allowing alignment to scale without proportional human effort.

Which model uses Constitutional AI?▾

It is the foundation of Anthropic's Claude training pipeline. The RLAIF approach has also influenced subsequent alignment research across other frontier labs.

Five breakout startups, every Sunday — before the round gets crowded

The free Acceleration Watch: five venture-backed teams accelerating on the engineering signal, translated into plain English — 21 to 47 days before the deck circulates. No code-reading, no card.

Get the free Sunday issue →

Signed The Data Nerd · pseudonymous narrator · methodology over personality

Other research papers

NeurIPS 2017 · 2017

Attention Is All You Need

NeurIPS 2020 · 2020

Language Models are Few-Shot Learners

NeurIPS 2022 · 2022

Training language models to follow instructions with human feedback

NeurIPS 2020 · 2020

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

ICLR 2022 · 2021

LoRA: Low-Rank Adaptation of Large Language Models

NeurIPS 2022 · 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Read our own methodology paper

Code-Side Sourcing methodology, replicable on the open dataset.

Read /methodology

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones

Anthropic

Abstract summary

Our summary in our own words — see the canonical source links below for the original abstract.

Why we cite this paper

Key findings

1RLAIF (AI feedback) can substitute for RLHF (human feedback) at scale while maintaining alignment quality.
2A written 'constitution' of principles enables transparent control over model behavior.
3Models trained with CAI are both more helpful and more harmless than RLHF baselines on Anthropic's benchmarks.
4Scalable oversight via AI feedback is the path to alignment as models exceed human-evaluator capacity.

Frequently Asked Questions

How does Constitutional AI differ from RLHF?▾

Who published the Constitutional AI paper?▾

Anthropic. Authors include Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, and Andy Jones (arXiv:2212.08073, 2022).

What is RLAIF?▾

Which model uses Constitutional AI?▾

It is the foundation of Anthropic's Claude training pipeline. The RLAIF approach has also influenced subsequent alignment research across other frontier labs.

Five breakout startups, every Sunday — before the round gets crowded

The free Acceleration Watch: five venture-backed teams accelerating on the engineering signal, translated into plain English — 21 to 47 days before the deck circulates. No code-reading, no card.

Get the free Sunday issue →

Signed The Data Nerd · pseudonymous narrator · methodology over personality

Read our own methodology paper

Code-Side Sourcing methodology, replicable on the open dataset.

Read /methodology

Constitutional AI: Harmlessness from AI Feedback

Abstract summary

Why we cite this paper

Key findings

Canonical sources

Related glossary terms

Frequently Asked Questions

Five breakout startups, every Sunday — before the round gets crowded

Other research papers

Attention Is All You Need

Language Models are Few-Shot Learners

Training language models to follow instructions with human feedback

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

LoRA: Low-Rank Adaptation of Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Read our own methodology paper

🚀 Explore Our Network

Constitutional AI: Harmlessness from AI Feedback

Abstract summary

Why we cite this paper

Key findings

Canonical sources

Related glossary terms

Frequently Asked Questions

Five breakout startups, every Sunday — before the round gets crowded

Other research papers

Attention Is All You Need

Language Models are Few-Shot Learners

Training language models to follow instructions with human feedback

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

LoRA: Low-Rank Adaptation of Large Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Read our own methodology paper