Hello! I am new here and currently evaluating Ray as a possible replacement to our current batch scheduling solution. I want to make sure I understand how Ray schedules.
How do tasks get scheduled onto worker nodes when the head node has CPU=0? Assuming Ray Clients always connect to the head node by
ray.init(<head node address>).
From the Ray 1.0 Architecture Whitepaper…
### Scheduling policy A raylet always attempts to grant a resource request using local resources first. When there are no local resources available, there are three other possibilities: 1. Another node has enough resources, according to the possibly stale information published by the GCS. The raylet will spillback the request to the other raylet. 2. No node currently has enough resources. The task is queued locally until resources on the local or remote node become available again.
So does this mean, all tasks go through the head node’s raylet, and that raylet will always try to find another node with enough resources and return that to the client so the client can retry their request on that worker node? (assuming there is a worker with enough resources).