Spill/Restore workers not registering within timeout

adityatv · August 13, 2023, 8:00pm

How severe does this issue affect your experience of using Ray?

Low: It annoys or frustrates me for a moment.

Hello,

I am getting a lot of the following error messages

worker_pool.cc:544: Some workers of the worker process(141872) have not registered within the timeout. The process is still alive, probably it's hanging during start.

When I look at the worker process ids they are all spill or restore workers whose log looks like the following:

[2023-08-13 08:30:10,296 I 139895 139895] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 139895
[2023-08-13 08:30:37,591 I 139895 139895] io_service_pool.cc:35: IOServicePool is running with 1 io_service.

My program is still running so it isn’t a blocker but I’m curious if there’s any way to resolve this in case it becomes one later in the jobs. I am running on a SLURM cluster with Ray 2.6.2. Thanks in advance!

XIE · August 15, 2023, 5:27am

cc: @sangcho @rickyyx

sangcho · August 21, 2023, 1:38pm

Is this consistently reproducible?

adityatv · August 21, 2023, 10:09pm

Hello @sangcho. I have not run into the issue again, so I have not tried to reproduce the issue. I was mostly curious to see why this may have occurred.

sangcho · August 22, 2023, 2:46pm

This happens if the worker is somehow not started on time (in 30 seconds). There could be many reasons why it can happen. Some potential possiblities

The load makes it slow to start a new worker.
The worker is using a bad dependency and is broken before it starts.

But it is hard to figure out the exact root cause if it is not reproducible

Topic		Replies	Views
Ray init fails to register workers Ray Core	9	1631	August 17, 2022
How to set ray.worker.timeout Ray Core	1	903	April 18, 2022
Raylet errors some worker have not registered within the timeout Ray Core	31	2614	March 30, 2023
(raylet core_worker.cc:451: Failed to register worker to Raylet. Invalid: Invalid: Unknown worker Ray Core	2	506	January 10, 2022
Worker pool in debug log not correct Ray Core	6	369	January 11, 2022

Spill/Restore workers not registering within timeout

Related Topics