Spill/Restore workers not registering within timeout

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hello,

I am getting a lot of the following error messages

worker_pool.cc:544: Some workers of the worker process(141872) have not registered within the timeout. The process is still alive, probably it's hanging during start.

When I look at the worker process ids they are all spill or restore workers whose log looks like the following:

[2023-08-13 08:30:10,296 I 139895 139895] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 139895
[2023-08-13 08:30:37,591 I 139895 139895] io_service_pool.cc:35: IOServicePool is running with 1 io_service.

My program is still running so it isn’t a blocker but I’m curious if there’s any way to resolve this in case it becomes one later in the jobs. I am running on a SLURM cluster with Ray 2.6.2. Thanks in advance!

cc: @sangcho @rickyyx

Is this consistently reproducible?

Hello @sangcho. I have not run into the issue again, so I have not tried to reproduce the issue. I was mostly curious to see why this may have occurred.

This happens if the worker is somehow not started on time (in 30 seconds). There could be many reasons why it can happen. Some potential possiblities

  • The load makes it slow to start a new worker.
  • The worker is using a bad dependency and is broken before it starts.

But it is hard to figure out the exact root cause if it is not reproducible