Hi,
My ray application requires that certain tasks are assigned to a specific group of nodes. For example,
task1 assigned to group1: node1,2,3
task2 assigned to group2: node4,5,6,7,8
task3 assigned to group3: node9
In case of node failure, e.g., node 1, failed, I’m thinking of relying on ray autoscaler to add a new node to the cluster, and then inject this node resource to the group1.
To be more concrete about the underlying problem here, we would ideally like to:
Launch Ray cluster A to run tasks that collect stats for N table partitions.
Based on these stats, create or modify node groups in Ray Cluster A to assign min/max node group sizes for upcoming per-partition transforms that will be assigned to each node group.
Run a transform for each partition on ray cluster A, where each transform uses nodes from only 1 group.
I see, I’m assuming the nodes in your cluster all homogenous (or at least you don’t care which node group a particular version of a node belongs to). This doesn’t give you a strong guarantee, but you may consider using placement groups as your node groups with a PACK policy, though there are caveats to this (you may want all tasks to be num_cpus=1, etc).
You can also dynamically modify the autoscaler config, but that’s probably not what you want to be doing.
Right - a dynamic-resource-based solution is actually where we started our discussion, and resurrecting it may be the best way forward here.
Placement groups could work, but seem like they make us concede to (1) having a static-sized node group per partition (we’d prefer autoscaling node groups) and (2) losing dynamic specification of the memory requirements of each task at execution time. This is based on the assumptions that (1) placement group creation always triggers on-demand autoscaling, and (2) dynamic task resource requirements cannot be specified together with a placement group (e.g. some_task.options(scheduling_strategy=PlacementGroupSchedulingStrategy(...), resources={...}).remote()).
Specifying custom resources during ray start also seems theoretically possible, but the process to determine the resource to attach to a node being started may be complex, since we’d need to introspect the current state of the cluster first to decide which shard the node should be placed in.