Performance overhead numbers

ex_1000_1.1GB_files_0.25cpu_per_function40_nodes_8x32
ex_1000_1.1GB_files_0.125cpu_per_function40_nodes_8x32
ex_awss3_300M_chunksize_40_nodes_1.0cpus_per_call
Hi— i’d have a quick performance/scale related question.

I created a ray cluster with 50 VMs, each having 8 vCPUs and 32GB of memory,. In the cluster yaml i specified that each VM has 7 vCPU available. In the code you can see below, I specified that each function call should be getting 0.5 vCPU.

So in an ideal world, it should be possible to run 7*50/0.5=700 function calls in parallel. In that ideal world, 10k of these calls would then take 10k/700= ~14 seconds.

Having in mind that this is just the theoretical optimum, if there is no overall at all from the scheduler, it feels quite long that it takes in my test a bit more than 2 minutes.

Are you seeing the same, or am I missing something, or not using things correctly?

In particular, i’m wondering whether part of the problem could be that scheduling seems to be done in a kind of sequential fashion. That is kinda indicated also by some graphs i generated for running some other logic on ray. Normally I’d expect that the lines representing when each function invocation start at least more parallel than what is implied by this graph. wdyt?

ray.init(address="auto")

@ray.remote(num_cpus=0.5)
def f(x, i):
    print(i)
    time.sleep(0.5)
    return i * i

x=2
futures = [f.remote(x, i) for i in range(10000)]
print(ray.get(futures))

Hi, @mbehrendt Thanks for the post! I am also curious if you are seeing the same behavior if you set num_cpus=1 instead of 0.5. In this case, would it take 1 minute (half of what you observe), or a lot faster than that?

Currently, lots of scheduling overhead we found is from worker process startup time, and one of causes could be we limit the # of concurrent worker startup to num_cpus (so num_cpus=0.5 means 8 workers are started first, and other 8 are going to be blocked until that’s done). I am curious if this is one of the factors of scheduling behavior you observed

Hi @sangcho – thx a lot for the feedback.

I think this can be broken down into 2 sub-problems:

  1. Time a single scheduling and execution operation takes
  2. Time many of them take, when invoked in parallel.

While work can probably be done on (1), the data I’ve been collecting so far seems to apply that the scheduling of workloads is a sequential operation, and not one that is done in parallel. Based on that, at least for me personally it would be interesting to better understand whether the sequential behavior of the scheduler works as-designed, or whether that is not expected. So even if we could cut exec time in half by maybe using double the amount of CPUs, it would be more interesting my mind whether we can achieve parallelism, as that could bring down the execution times way below one min, without having to increase the amount of vCPUs assigned to this.
WDYT?

I did a quick test, changing num_cpus to 1 improved a lot: the time for 10k task went down from 1m52s to 45s. Although I haven’t fully understand the reason yet.

did you do that in a cluster of multiple machines, or rather a single large one?

I think the primary challenge i’m seeing in my tests is that it takes 40-80 seconds to achieve maximum concurrency. Is that something you’re looking into? @jjyao

I’m using a cluster of multiple machines. What’s the num_cpus you are using? Could you try num_cpus=1.

it would be interesting to better understand whether the sequential behavior of the scheduler works as-designed,

For this comment, I believe this is intentional. IIUC, we are submitting tasks one by one currently. There’s a work in progress where we submit tasks as many as num nodes at a time instead of one by one. cc @Alex for confirmation (and why it is done this way).

cool, thanks a lot for sharing these details. Is there already a rough target timeframe for this enhancement?

I already implemented it (if that’s what sang was referring to): https://github.com/ray-project/ray/pull/18647. You can change RAY_max_pending_lease_requests_per_scheduling_category to increase the parallelism (you can set it to the number of nodes).

i used 0.5 vCPUs. But i just also ran the scenario with 1.0 cpus. The improved the ramp-up during startup, but also resulted in things being slower overall.