Hi,
I’m trying to set up ray in a sun grid environment where the typical host has 32 cores.
However, the best chance to get a job/worker to be dispatched is when the required number of cores for the job is <= 4.
My issue is that it is very likely that sometimes, multiple workers will get launched on the same host.
Does launching a ray cluster work correctly in this scenario?
Maybe I’m looking at this all incorrectly, but in our environment, I (or ray up) can’t ssh to a machine on the farm until a job has been started on that machine using qsub (w/ num cpus). Once all the jobs have started, I can get the list of hosts, but because of the above situation, I may have multiple jobs running on the same host.
Alternatively, is there a way to specify a max_workers per worker_ip? This would work because I know in advance how many jobs got started on each host.
Thanks.