Hello, I just wanted to clarify a few things about the usage of Ray clusters on K8s.
- If there’s only one worker type, would you recommend having multiple worker pods on the same K8s node, or is it preferable to set it up in such a way that only one worker pod gets scheduled on each K8s node? For example, is it better to have 2 pods with 7GB and 2vCPU each on a 15GB, 4vCPU machine; or 1 pod with 15GB, 4vCPUs on the same machine type.
- How many Ray clusters is the Ray operator able to manage? If we use a cluster-scoped Ray operator and deploy Ray in
nnamespaces, at what value of
n(roughly – 10, 50, 100, etc) would the operator start facing issues? Assuming each Ray Cluster is actively in use and can scale from 1-50 pods each. And is the scaling concurrent when it deals with so many ray clusters at once?
- This is more of a general Ray cluster question, how many resources should the head node be assigned if we ensure that no user task gets scheduled on it by making
rayResourceszero? Or in other words, what could cause heavy memory or CPU usage in the head node? Heavy scaling activity? Lots of data stored in the object store? The concern here is that the head node resource usage might shoot up when more worker pods get added to the Ray cluster. So a resource allocation for the head node that works during testing might break in production when dealing with a high number of worker pods.
- Considering that losing the Ray head node will cause the entire Ray cluster to restart, it sounds to me that the head alone might be better off being scheduled on an on-demand node (not spot). This is probably a huge stretch, but is there any possibility of setting up a fault-tolerant head node? If not, is something like that in Ray’s roadmap? (Leader election, etc)