Launching a cluster where multiple workers may end up on same hosts

varga · April 1, 2022, 12:57pm

Hi,
I’m trying to set up ray in a sun grid environment where the typical host has 32 cores.
However, the best chance to get a job/worker to be dispatched is when the required number of cores for the job is <= 4.
My issue is that it is very likely that sometimes, multiple workers will get launched on the same host.
Does launching a ray cluster work correctly in this scenario?

Maybe I’m looking at this all incorrectly, but in our environment, I (or ray up) can’t ssh to a machine on the farm until a job has been started on that machine using qsub (w/ num cpus). Once all the jobs have started, I can get the list of hosts, but because of the above situation, I may have multiple jobs running on the same host.

Alternatively, is there a way to specify a max_workers per worker_ip? This would work because I know in advance how many jobs got started on each host.

Thanks.

varga · April 3, 2022, 5:52pm

Alternatively, if the script that is starting up my workers recognizes that there is already a worker running on the same host, is there a way to increase the num-cpus for the first worker such that I don’t have to start the 2nd worker at all?

Topic		Replies	Views
Local cluster with multiple nodes in YAML config, while there's only head being started... Any hints? Ray Clusters	11	1628	June 17, 2022
Ray distributed memory parallelism Ray Core	3	444	October 20, 2023
Pool in a Ray cluster is sending the same number of jobs to different nodes even though the nodes have different sizes/different number of CPUs Kubernetes	6	662	June 8, 2022
Trying to run a cluster at home Ray Clusters	0	404	June 30, 2021
Ray on SLURM/HPC: starting worker nodes simultaneously Ray Clusters	10	1987	June 15, 2022

Launching a cluster where multiple workers may end up on same hosts

Related topics