Expanding RLlib learning environment with multiple simulators and machines while reducing communication overhead

woosangbum · March 31, 2023, 10:11am

Hello
I started using rllib to solve a problem, but I’m having difficulty expanding the learning environment. I’m using a gymnasium custom environment (gym.env) and a simulator, and learning with rllib, tune, and SAC algorithms.

Now, I want to use RAY’s capabilities to run multiple simulators and environments on multiple machines to speed up learning. However, due to licensing issues, only one simulator can be run on one computer, so I want to reduce communication overhead by having a rollout worker and a simulator on one computer.

In conclusion, I want to get a trainer worker that learns using a GPU and a separate rollout worker configuration for each computer. I have tried the following:

Ray cluster When I ran a script with num_rollout_workers = n on the head node, the rollout worker was not pinned to a particular computer but was placed automatically by the scheduler.
Client-server I tried to write a rollout worker client script and a trainer server script. However, it was difficult, and finding reference materials was also challenging.

What is the best way to solve the problem in the above situation? Thank you in advance.

arturn · June 23, 2023, 7:38pm

Hi @woosangbum ,

You need to define a custom resource that your rollout workers consume: Please read this. On each node that runs a simulator, the custom resource should be available accordingly.
Then, when you define your AlgorithmConfig, you need to set custom_resources_per_worker to use this custom resource.
Is it not possible to wrap the simulator in a gym environment? That’s the most straightforward way. The server/client scripts are not preferred.

Topic		Replies	Views
Custom simulator with as RLlib environment RLlib	1	476	December 17, 2020
Different Environment for training and evaluation RLlib	5	1190	July 13, 2021
My Ray programs stops learning when using distributed compute RLlib	10	1079	August 16, 2022
Reserve workers on GPU node for trainer workers only RLlib	7	1112	June 3, 2022
Training parallelisation in RLLIB Configure Algorithm, Training, Evaluation, Scaling	3	603	December 9, 2022

Expanding RLlib learning environment with multiple simulators and machines while reducing communication overhead

Related topics