Hi y’all,
I’m trying to scale an RLLIB training job on k8’s using a RayCluster
spec or RayJob
via KubeRay.
However, it’s tricky to figure out how to map parts of the RLLIB training code to different parts of the spec.
I want one gpu-enabled node to act as my learner
node, with many different rollout
nodes gathering experience async. Should I spec this as:
- one big head node, with GPU, that spins up workers within it?
- one head node, with GPU to act as
learner
, with aworkerGroupSpec
for myrollout
nodes? - one head node, just to manage the training job, with a
workerGroupSpec
with GPU for mylearner
, and a secondworkerGroupSpec
for myrollout
nodes?
Is there some sort of example code I could look at for leveraging RLLIB on k8s? I’m not sure how to make them fit together.