What is the expected startup time of worker processes?

Tobias_Deussing_A · June 14, 2024, 11:51am

Hello everyone, We try to use Ray in maybe a bit of an unconventional way. We do have calculations which are a bit to big to run them on a single node, so we try to distribute them over multiple nodes using Ray. Because the software need some environment set up around it (files on the right node), we use a placement group for that and the application will always use the same amount of workers as Worker Nodes available. We run the Ray components inside of containers on K8s and start them up using the CLI and do not use the operator.
Now the caveat: We want to use the calculations for interactive use and therefore we are a bit sensitive on latency of the setup.
During testing I figured out, that if I call endpoint shortly after another, Ray does not add any noticeable overhead. When I wait a bit longer (like 2 seconds), the next request takes between 2.5 and 3.1 seconds to execute.
I already pinned it down to the way Ray handles leases.
When I run into the rather short timeout, the Worker process is killed by the Raylet, so a subsequent request needs to startup a new Worker process to execute the request.I already set RAY_enable_worker_prestart to 1 and RAY_worker_lease_timeout_milliseconds to 50000 but both did not have any effect on the timings.So my actual question is the following: What is the expected time until a calculation is executed on a worker, when there is no lease and the Worker and Head are on different nodes?
I try to get a feeling of how long is normal for that process and how to tune that values.
From the Architecture document I got, that there is quite a bit of back and forth involved between the Raylet and the Head, so the time does not look completely unreasonable, but 3 seconds is also not too fast for such a startup in my view.

Sam_Chan · June 15, 2024, 1:08am

Are you using a static K8s cluster?

Tobias_Deussing_A · June 17, 2024, 7:25am

We are managing the worker nodes and head node ourself, so I think it is considered static.
It does not scale automatically to new nodes.

Topic		Replies	Views
Ray job start up too slow on workers	0	526	September 11, 2022
Long start up time Ray Core	4	202	January 6, 2024
Ray Actor creation lasts 5 minutes Ray Core	1	622	February 6, 2022
Ray on SLURM/HPC: starting worker nodes simultaneously Ray Clusters	10	2041	June 15, 2022
Creation an Actor on Worker nodes takes long time Ray Clusters	11	1125	February 9, 2022

What is the expected startup time of worker processes?

Related topics