We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator’s verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a surrogate, steering policy optimization towards more valuable regions of the reward landscape thus learning an optimal policy. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments being able to provide a surrogate reward to the policy and direct the optimization towards valuable areas.
@inproceedings{ altmannALA23,
author = "Philipp Altmann and Thomy Phan and Fabian Ritz and Thomas Gabor and Claudia Linnhof-Popien",
title = "DIRECT: Learning from Sparse and Shifting Rewards Using Discriminative Reward Co-Training",
year = "2023",
abstract = "We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator’s verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a surrogate, steering policy optimization towards more valuable regions of the reward landscape thus learning an optimal policy. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments being able to provide a surrogate reward to the policy and direct the optimization towards valuable areas.",
url = "https://thomyphan.github.io/files/2023-ala.pdf",
eprint = "https://thomyphan.github.io/files/2023-ala.pdf",
location = "London, UK",
booktitle = "15th Adaptive and Learning Agents Workshop",
keywords = "reinforcement learning, curriculum learning, exploration",
}
Related Articles
- P. Altmann et al., “Challenges for Reinforcement Learning in Quantum Computing”, QCE 2024
- P. Altmann et al., “CROP: Towards Distributional-Shift Robust Reinforcement Learning using Compact Reshaped Observation Processing”, IJCAI 2023
- R. Müller et al., “Towards Anomaly Detection in Reinforcement Learning”, AAMAS BlueSky Ideas 2022
- F. Ritz et al., “Specification Aware Multi-Agent Reinforcement Learning”, Book of ICAART 2021 (journal version)
- F. Ritz et al., “SAT-MARL: Specification Aware Training in Multi-Agent Reinforcement Learning”, ICAART 2021 (conference version)