Feb
Deep Reinforcement Learning from Human Preferences
The authors propose to train a reward model by comparing human preferences. The dataset contains of triples \((\sigma^1 \sigma^2, \mu)\), where \(\sigma^1\) and \(\sigma^2\) are the trajectories and \(\mu\) is the a distribution indicating the human preferences between these two trajectories.
The authors model the reward function \(r\) by some neural networks and define the parameterized preference as
\begin{align} \Pr(\sigma^1 \succ \sigma^2; r ) := \frac{\exp \sum_t r(o_t^1, a_t^1)}{\sum*{i=1,2} \exp \sum_t r(o_t^i, a_t^i)}. \end{align}
The loss is to match the distribution between \(\Pr\) and \(\mu\)
\begin{align} \mathcal L(r) = - \sum_{(\sigma^1, \sigma^2, \mu) \in \mathcal D} \mu(1) \log \Pr(\sigma^1 \succ \sigma^2) + \mu(2) \log \Pr(\sigma^2 \succ \sigma^1). \end{align}
The authors mentioned some modifications to this basic algorithms in practice.
@article{christiano2017deep,
title = {Deep reinforcement learning from human preferences},
author = {Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario},
journal = {NIPS},
year = {2017},
}