Communication cost of job scheduling

Hi, say I want to dispatch tasks to multiple slurm nodes. Is there any difference between defining the atomic task and defining grouped atomic tasks regarding the communication cost of job scheduling?

1. define the atomic task

@ray.remote(num_cpus=1)
def atomic_task(args):
    some_heavy_work(args)

2. define grouped atomic tasks

@remote(num_cpus=1)
def grouped_atomic_tasks(list_of_args):
    for args in list_of_args:
        some_heavy_work(args)

list_of_args is a chunk of args manually grouped by the user for each worker.

There’s definitely overhead of task scheduling though it won’t be too big (it should be at max order of milliseconds even in a large cluster). But if you are doing heavy work, it won’t likely to be a problem.

Thanks! How about the cost of putting objects into the object store and getting them back by each worker? Is there a significant difference between these two ways of defining jobs?

They should be really fast if you are using objects that support zero-copy serialization (e.g., numpy) because when worker accesses them, there’s no serialization cost.

If zero-copy is not supported, there’s still a serialization cost, but the put & get itself’s overhead shouldn’t be significant if your object is big enough.

1 Like