How does ray decide where to run a function?

rkube · December 22, 2021, 9:09pm

Hi,
I’m working with ray on a SLURM-managed cluster and am puzzled how ray distributes work among its workers. In a test case I’m executing a benchmark function 8 times on a cluster of 4 nodes (1 head node, 3 workers). From the output it seems like the function is always executed on the same node.
All workers and head nodes are identical, num_cpus=32 and there are no special requirements in the benchmark function decorator. What am I missing?

I’m also observing that the same benchmark code runs about 3x slower on the ray cluster than locally. Is this because the same worker is executing multiple instances of run_benchmark(A) simultaneously?

Source code and output is here: gist:4d1ff3d8ead4a46cf51cb750759e8a21 · GitHub

rkube · December 23, 2021, 10:53am

I think I solved the problem. The missing part is to modify the decorator to
@ray.remote(num_cpus=32).

How exactly is this information used internally by ray? Does it tag the function to take up 32 CPU cores of each worker and inform how other functions will be distributed across actors? That is, if a worker with 32 CPU cores is already executing a function decorated with @ray.remote(num_cpus=32) it won’t be executing another one of this functions at the same time?

Topic		Replies	Views
Execute function on each worker process in the cluster Ray Core	4	660	October 5, 2023
How to run a function exactly once on each node? Ray Core	4	2139	May 18, 2021
Letting remote function use all CPUs? Ray Core	9	541	March 10, 2021
Placement group with iterator to spread function to all CPU's in the cluster Ray Core	6	380	June 8, 2022
Running a list of functions with limited parallelism and autoscaling Ray Core	2	352	February 8, 2022

How does ray decide where to run a function?

Related topics