Distributed APPO With Flexible Number of Workers and Custom Environment

Hi,

A friend and I are working on creating RL algorithm (currently looking at LSTM/PPO) to play a specific game, let’s call it a card game. However, since the game is external we have created a way to interact with the game using simulated mouse movements/clicks and some image recognition to identify some key values for the current game state, the rest is calculated and stored internally.

These game can take a while, so we were hoping to be able to spin up VM’s on our computers and depending on the time of day, say 2pm vs 3am, have various number of VM’s actively playing the game and being trained on my computer which will be running 24/7. As a sidenote, due to having to control mouse movements, we can only have one player/environment per VM.

There is a lot to unpack so I would like to get some guided opinions on the best way to approach this, from what I’ve seen, Ray got everything necessary for this to work, what we will need to do:

  1. Wrap our current game interactivity in a custom environment, and implement the methods as in this example: https://github.com/ray-project/ray/blob/master/rllib/examples/env/parametric_actions_cartpole.py
  2. Create a custom evaluation function for our env as per: https://github.com/ray-project/ray/blob/82f9c7014e2d0acd3e3869066f5dc3142ec9e7a7/rllib/agents/trainer.py#L730

After that is done, I’d need to setup some distributed Logic (initial sites set on APPO). However, I’m not sure if this is the best approach and which examples or documentation pages would be most applicable to my use case.

So if anyone could confirm whether #1 and #2 are enough to port our current code to work with RLlib, and how to pair that with the distributed part we would be very thankful!