Expanding RLlib learning environment with multiple simulators and machines while reducing communication overhead

Hello
I started using rllib to solve a problem, but I’m having difficulty expanding the learning environment. I’m using a gymnasium custom environment (gym.env) and a simulator, and learning with rllib, tune, and SAC algorithms.

Now, I want to use RAY’s capabilities to run multiple simulators and environments on multiple machines to speed up learning. However, due to licensing issues, only one simulator can be run on one computer, so I want to reduce communication overhead by having a rollout worker and a simulator on one computer.

In conclusion, I want to get a trainer worker that learns using a GPU and a separate rollout worker configuration for each computer. I have tried the following:

  1. Ray cluster When I ran a script with num_rollout_workers = n on the head node, the rollout worker was not pinned to a particular computer but was placed automatically by the scheduler.
  2. Client-server I tried to write a rollout worker client script and a trainer server script. However, it was difficult, and finding reference materials was also challenging.

What is the best way to solve the problem in the above situation? Thank you in advance.

Hi @woosangbum ,

  1. You need to define a custom resource that your rollout workers consume: Please read this. On each node that runs a simulator, the custom resource should be available accordingly.
    Then, when you define your AlgorithmConfig, you need to set custom_resources_per_worker to use this custom resource.
  2. Is it not possible to wrap the simulator in a gym environment? That’s the most straightforward way. The server/client scripts are not preferred.