Hi y’all,
I’m trying to scale an RLLIB training job on k8’s using a RayCluster spec or RayJob via KubeRay.
However, it’s tricky to figure out how to map parts of the RLLIB training code to different parts of the spec.
I want one gpu-enabled node to act as my learner node, with many different rollout nodes gathering experience async. Should I spec this as:
- one big head node, with GPU, that spins up workers within it?
- one head node, with GPU to act as
learner, with aworkerGroupSpecfor myrolloutnodes? - one head node, just to manage the training job, with a
workerGroupSpecwith GPU for mylearner, and a secondworkerGroupSpecfor myrolloutnodes?
Is there some sort of example code I could look at for leveraging RLLIB on k8s? I’m not sure how to make them fit together.