Algorithmic Foundations for Safe and Efficient Reinforcement Learning from Human Feedback
Open access
Author
Date
2023Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Reinforcement learning (RL) has shown remarkable success in applications with well-defined reward functions, such as maximizing the score in a video game or optimizing an algorithm’s run-time. However, in many real-world applications, there is no well-defined reward function. Instead, Reinforcement Learning from Human Feedback (RLHF) allows RL agents to learn from human-provided data, such as evaluations or rankings of trajectories. In many applications, human feedback is expensive to collect; therefore, learning robust policies from limited data is crucial. In this dissertation, we propose novel algorithms to enhance the sample efficiency and robustness of RLHF.
First, we propose active learning algorithms to improve the sample efficiency of RLHF by selecting the most informative data points for the user to label and by exploring the environment guided by uncertainty about the user’s preferences. Our approach provides conceptual clarity about active learning for RLHF and theoretical sample complexity results, drawing inspiration from multi-armed bandits and Bayesian optimization. Moreover, we provide extensive empirical evaluations in simulations that demonstrate the benefit of active learning for RLHF.
Second, we extend RLHF to learning constraints from human preferences instead of or in addition to rewards. We argue that constraints are a particularly natural representation of human preferences, particularly in safety-critical applications. We develop algorithms to learn constraints effectively from demonstrations with unknown rewards and actively learn constraints from human feedback. Our results suggest that representing human preferences as constraints can lead to safer policies and extend the potential applications for RLHF.
The proposed algorithms for reward and constraint learning serve as a foundation for future research to enhance the efficiency, safety, and applicability of RLHF. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000635156Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
reinforcement learning; Inverse reinforcement learning; preference learning; reinforcement learning from human feedbackOrganisational unit
03908 - Krause, Andreas / Krause, Andreas
More
Show all metadata
ETH Bibliography
yes
Altmetrics