Slurm Autoscaler

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.
    Medium

I’ve just set up a cluster with a ray autoscaleron SLURM largely following the instructions from

with some modifications largely due to the interfaces being out of date. Overall the autoscaler does kind of work but I’m running into problems that not all the worker node cpu’s are being made available. ray status gives 90 workers but only 13 cpus (supposed to be 90), min_workers=0 and max_workers=90:

======== Autoscaler status: 2024-01-22 03:27:08.871544 ========
Node status

Healthy:
1 head_node
90 worker_node
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage:
13.0/13.0 CPU
0B/13.09TiB memory
1.90GiB/2.45TiB object_store_memory

Demands:
{‘CPU’: 1.0}: 83+ pending tasks/actors

If I run the worker.slurm script manually it always seems to add a worker and an available cpu, so I don’t think the issue is with the slurm command. also ray down correctly brings down the cluster. I’ve run for the last 3 months ray clusters on slurm using the instructions on the ray site and never had an issue like this, but that is w/o autoscaling.
On stdout on the job console I’m seeing the following message pop-up regularly:

(autoscaler +24m50s, ip=10.66.12.1) Restarting 16 nodes of type worker_node (lost contact with raylet).

Any advice on how to debug this?

Thanks for any reply!

Problem solved, kind of. For others who might run into this (and I sincerely hope that someone on this support group can explain to me the approved way of doing this), the problem stems from the load_metrics keeping track of resources, be they total or used, via the node ip_address. Because in slurm multiple workers can have the same ip address then you can get an understimate of resources and this will give an erroneous ‘ray status’ and state to the autoscaler. To get this to work correctly I added 2 fields to the load_metrics:

        self.ip_by_raylet_id = {}
        self.cpu_load = {}
        

then modified update to do:

    def update(
        self,
        ip: str,
        raylet_id: bytes,
        static_resources: Dict[str, Dict],
        dynamic_resources: Dict[str, Dict],
        resource_load: Dict[str, Dict],
        waiting_bundles: List[Dict[str, float]] = None,
        infeasible_bundles: List[Dict[str, float]] = None,
        pending_placement_groups: List[PlacementGroupTableData] = None,
        cluster_full_of_actors_detected: bool = False,
    ):
        if 'CPU' in resource_load:
            if ip not in self.cpu_load:
                self.cpu_load[ip] = resource_load['CPU']
            else:
                self.cpu_load[ip] += resource_load['CPU']

        self.resource_load_by_ip[ip] = resource_load
        self.static_resources_by_ip[ip] = static_resources
        self.raylet_id_by_ip[ip] = raylet_id
        self.cluster_full_of_actors_detected = cluster_full_of_actors_detected
        self.ip_by_raylet_id[raylet_id] = ip
        if 'CPU' in self.static_resources_by_ip[ip]:
            total = 0
            for raylet_id_index in self.ip_by_raylet_id:
                if ip == self.ip_by_raylet_id[raylet_id_index]:
                    total += 1
            self.static_resources_by_ip[ip]['CPU'] = float(total)
        for ip in self.cpu_load:
            self.resource_load_by_ip[ip]['CPU'] = self.cpu_load[ip]

The cluster I’m using only has CPUs so I could hardcode this in. I hope that someone tells that there is a simple modification of the yaml that will take care of everything and I can undo this mess but if there isn’t, I do think that tracking those 2 extra quantities will fix a lot of ray with SLURM issues and shouldn’t harm the other architectures.