Example using RLLIB via KubeRay

Hi y’all,

I’m trying to scale an RLLIB training job on k8’s using a RayCluster spec or RayJob via KubeRay.

However, it’s tricky to figure out how to map parts of the RLLIB training code to different parts of the spec.

I want one gpu-enabled node to act as my learner node, with many different rollout nodes gathering experience async. Should I spec this as:

  • one big head node, with GPU, that spins up workers within it?
  • one head node, with GPU to act as learner, with a workerGroupSpec for my rollout nodes?
  • one head node, just to manage the training job, with a workerGroupSpec with GPU for my learner, and a second workerGroupSpec for my rollout nodes?

Is there some sort of example code I could look at for leveraging RLLIB on k8s? I’m not sure how to make them fit together.

Indeed, the Ray Docs are not so verbose on that point. Maybe you start following the RLlib CLI and try some learning-by-doing experiments (?)
https://docs.ray.io/en/latest/rllib/rllib-cli.html