Example using RLLIB via KubeRay

drj · May 14, 2024, 3:34pm

Hi y’all,

I’m trying to scale an RLLIB training job on k8’s using a RayCluster spec or RayJob via KubeRay.

However, it’s tricky to figure out how to map parts of the RLLIB training code to different parts of the spec.

I want one gpu-enabled node to act as my learner node, with many different rollout nodes gathering experience async. Should I spec this as:

one big head node, with GPU, that spins up workers within it?
one head node, with GPU to act as learner, with a workerGroupSpec for my rollout nodes?
one head node, just to manage the training job, with a workerGroupSpec with GPU for my learner, and a second workerGroupSpec for my rollout nodes?

Is there some sort of example code I could look at for leveraging RLLIB on k8s? I’m not sure how to make them fit together.

PhilippWillms · May 30, 2024, 9:02pm

Indeed, the Ray Docs are not so verbose on that point. Maybe you start following the RLlib CLI and try some learning-by-doing experiments (?)
https://docs.ray.io/en/latest/rllib/rllib-cli.html

Topic		Replies	Views
RLLib not using worker nodes in Ray Cluster Kubernetes	0	404	August 25, 2023
Reserve workers on GPU node for trainer workers only RLlib	7	1112	June 3, 2022
Worker nodes not utilized RLlib	1	291	June 10, 2022
Training parallelisation in RLLIB Configure Algorithm, Training, Evaluation, Scaling	3	603	December 9, 2022
Expanding RLlib learning environment with multiple simulators and machines while reducing communication overhead Configure Algorithm, Training, Evaluation, Scaling	1	423	June 23, 2023