Olivier Delalleau, Maxim Peter, Eloi Alonso, Adrien Logut
While most current research in Reinforcement Learning (RL) focuses on improving the performance of the algorithms in controlled environments, the use of RL under constraints like those met in the video game industry is rarely studied. Operating under such constraints, we propose Hybrid SAC, an extension of the Soft Actor-Critic algorithm able to handle discrete, continuous and parameterized actions in a principled way. We show that Hybrid SAC can successfully solve a high-speed driving task in one of our games, and is competitive with the state-of-the-art on parameterized actions benchmark tasks. We also explore the impact of using normalizing flows to enrich the expressiveness of the policy at minimal computational cost, and identify a potential undesired effect of SAC when used with normalizing flows, that may be addressed by optimizing a different objective.
Reinforcement Learning (RL) applications in video games have recently seen massive advances coming from the research community, with agents trained to play Atari games from pixels (Mnih et al. 2015) or to be competitive with the best players in the world in complicated imperfect information games like DOTA 2 (OpenAI 2018) or StarCraft II (Vinyals et al. 2019a; 2019b). These systems have comparatively seen little use within the video game industry, and we believe lack of accessibility to be a major reason behind this. Indeed, really impressive results like those cited above are produced by large research groups with computational resources well beyond what is typically available within video game studios.
Our contributions are geared towards industry practitioners, by sharing experiments and practical advice for using RL with a different set of constraints than those met in the research community. For example, in the industry, experience collection is usually a lot slower, and there are time budget constraints over the runtime performance of RL agents. We thus favor off-policy algorithms to improve data efficiency by re-using past experience, and constrain our architectures
The approach we propose in this paper is based on Soft Actor-Critic (Haarnoja et al. 2018b), which was originally designed for continuous actions problems. We explore ways to extend it to a hybrid setting with both continuous and discrete actions, a situation commonly encountered in video games. We also attempt to use normalizing flows (Rezende and Mohamed 2015) to improve the quality of the resulting policy with roughly the same number of parameters, and analyze why this approach may not be working as well as we initially expected.
Results in a commercial video game
We trained a vehicle in a Ubisoft game, using the proposed Hybrid SAC with two continuous actions (acceleration and steering) and one binary discrete action (hand brake). The objective of the car is to follow a given path as fast as possible. Note that the agent operates in a test environment that it did not see during training, and that the discrete hand brake action plays a key role in staying on the road at such a high speed.