Aligning language models with human preferences

Korbak, Tomasz

Aligning language models with human preferences

thesis

posted on 2024-04-22, 12:59 authored by Tomasz Korbak

Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.

History

File Version

Published version

Pages

149

Department affiliated with

Informatics Theses

Qualification level

doctoral

Qualification name

phd

Language

eng

Institution

University of Sussex

Supervisor

Christopher L. Buckley, Anil Seth

Usage metrics

Keywords

Language models Reinforcement learning Alignment Finetuning Pretraining AI safety Human feedback

Aligning language models with human preferences

History

File Version

Pages

Department affiliated with

Qualification level

Qualification name

Language

Institution

Supervisor

Usage metrics

Categories

Keywords

Licence

Exports