![Thumbnail](/bitstream/handle/20.500.11850/579085/Ph__D__Thesis__Curi.pdf.jpg?sequence=5&isAllowed=y)
Open access
Author
Date
2022Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Reinforcement Learning (RL) has advanced the state-of-the-art in many applications in the last decade.
The root of its success stems from having access to high-quality simulators, controlled environments, and massive computing power.
Nonetheless, when the goal is to apply RL algorithms to real-world problems, many challenges remain unanswered.
This dissertation focuses on three of them: data efficiency, robustness, and safety.
On the one hand, practical algorithms that address these issues lack theoretical guarantees.
On the other hand, theoretically-sound algorithms are impractical.
This thesis aims to develop algorithms that achieve the best of both worlds.
Namely, we propose theoretically-sound algorithms that can be scaled using state-of-the-art neural networks and are easy to implement.
We take a model-based approach and learn models distinguishing between aleatoric and epistemic uncertainty.
The former is uncertainty inherent to the system, such as sensor noise.
In contrast, the latter stems from data scarcity, decreasing as we collect more data and expand our knowledge about the environment.
It is well-known that one needs to plan using epistemic uncertainty to achieve data-efficient exploration, robustness, and safety.
Unfortunately, the algorithms that do so are impractical as they require optimizing over the set of plausible models.
We reparameterize the set of plausible models to overcome this limitation.
In particular, we add a hallucinating control policy that directly acts on the model's outputs and has as much authority as the epistemic uncertainty that the model affords.
The reparameterization increases the action dimensions but reduces the intractable planning problem to one that standard RL algorithms can handle.
We first consider the problem of data-efficient exploration.
In this setting, the objective is to find an optimal policy with few interactions with the environment.
A theoretical approach to solve this problem is through optimism: an agent plans a policy using the most optimistic dynamics over the set of plausible models.
Unfortunately, this requires jointly optimizing policies and dynamics, which is intractable.
We propose the Hallucinated Upper Confidence RL (H-UCRL) algorithm.
By augmenting the input space with the hallucinated inputs, we solve H-UCRL using standard planners.
Hence, H-UCRLis practical while retaining its theoretical guarantees.
In particular, we show that H-UCRL attains near-optimal sample complexity guarantees, and we apply it to large-scale environments.
RL agents frequently encounter situations not present during training time in real-world tasks.
The RL agents must exhibit robustness against worst-case situations to ensure reliable performance.
The robust RL framework addresses this challenge via a worst-case optimization between an agent and an adversary.
Previous robust RL algorithms are either sample inefficient, lack robustness guarantees, or do not scale to larger problems.
We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to solve this problem provably.
RH-UCRL combines optimism with pessimism when planning with the model to output a robust policy.
Experimentally, we demonstrate that RH-UCRL outperforms other robust deep RL algorithms in various adversarial environments.
Finally, we address the problem of constraint satisfaction in RL.
This challenge is crucial for the safe deployment of RL agents in real-world environments.
We develop confidence-based safety filters, a control-theoretic approach for certifying state safety constraints for nominal policies learned via standard RL techniques.
We reformulate state constraints in terms of cost functions to reduce safety verification to a standard RL task.
The central idea of the safety filter is to filter the actions of the policy to ensure constraint satisfaction.
The safety filter executes a backup policy when we cannot verify that the constraints are satisfied.
This backup policy is assumed in most works, but we leverage the hallucinating inputs and learn the backup policy by solving a robust RL problem.
We provide formal safety guarantees for the safety filter and empirically demonstrate the effectiveness of our approach. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000579085Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Reinforcement Learning; Deep Learning; Learning control; Robustness; ExplorationOrganisational unit
03908 - Krause, Andreas / Krause, Andreas
Funding
815943 - Reliable Data-Driven Decision Making in Cyber-Physical Systems (EC)
More
Show all metadata
ETH Bibliography
yes
Altmetrics