Example using RLLIB via KubeRay

Hi y’all,

I’m trying to scale an RLLIB training job on k8’s using a RayCluster spec or RayJob via KubeRay.

However, it’s tricky to figure out how to map parts of the RLLIB training code to different parts of the spec.

I want one gpu-enabled node to act as my learner node, with many different rollout nodes gathering experience async. Should I spec this as:

  • one big head node, with GPU, that spins up workers within it?
  • one head node, with GPU to act as learner, with a workerGroupSpec for my rollout nodes?
  • one head node, just to manage the training job, with a workerGroupSpec with GPU for my learner, and a second workerGroupSpec for my rollout nodes?

Is there some sort of example code I could look at for leveraging RLLIB on k8s? I’m not sure how to make them fit together.

1 Like

Indeed, the Ray Docs are not so verbose on that point. Maybe you start following the RLlib CLI and try some learning-by-doing experiments (?)
https://docs.ray.io/en/latest/rllib/rllib-cli.html