Some question about Ray's task scheduling

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

Hi, I have some questions about task scheduling in Ray. While using Ray (with Java API), I noticed that the end-to-end latencies for invoking an empty normal task (3.4ms on my servers) are higher compared to invoking an empty actor task (2ms). To investigate the reason, I went through Ray’s source code and discovered that the latency gap is due to the RequestWorkerLease RPC required for normal task scheduling.

According to the architecture paper, “The caller may schedule any number of tasks onto the leased worker, as long as the tasks are compatible with the granted resource request. Hence, leases can be thought of as an optimization to avoid communication with the scheduler for similar scheduling requests”. However, I observed that each time the caller invokes a normal task, it sends a RequestWorkerLease RPC to the raylet to obtain a leased worker. The lease time doesn’t seem to be effective. I would like to know if there’s a way to make the lease time work, allowing multiple normal tasks to be scheduled with a single RequestWorkerLease RPC.

If Ray currently requires a separate RequestWorkerLease RPC for each normal task invocation, I would like to inquire whether this is an implementation issue. Is it possible to eliminate the latency cost associated with it?

Hi @Zhiyuan_Dong,

Sorry for the late reply. The worker lease works in such a way: once a leased worker finishes a task, we check if there is any pending tasks, if there is one AND the lease is not expired then we will schedule the new task to the leased worker otherwise, the leased worker is returned back to the raylet immediately.