(raylet core_worker.cc:451: Failed to register worker to Raylet. Invalid: Invalid: Unknown worker

HashBrown · December 15, 2021, 2:37pm

Whenever I push the cluster a little I am first hit with logs of the nature:

Available resources on this node: {0.000000/80.000000 CPU, 6553970890.039062 GiB/10590988490.039062 GiB memory, 4748258459.960938 GiB/4748258459.960938 GiB object_store_memory, 1.000000/1.000000}
 In total there are 1 pending tasks and 33 pending actors on this node.

This isn’t so scary because I know that I am pushing the cluster but all my actors exit once they finish to ensure progress (no deadlocks).

Butt later I am hit with

(raylet, ip=XX.XX.XX.XXX) [2021-12-15 08:24:54] core_worker.cc:451: Failed to register worker worker-id to Raylet. Invalid: Invalid: Unknown worker

This is a little more scary as I have no idea what it mean, and my job doesn’t seem to do everything that it was supposed to (I am assuming some of the actors didn’t make it)

So I was just wondering if anyone could give me some insight into this error. I understand that it is likely a result of pushing the cluster a little to hard, but please note that I am unable to gauge the exact resources in my application ahead of time and just make a rough guess. However it is designed such that when the cluster becomes full forward progress is still ensured as occupying actors exit and release their resources.

How can I avoid it? Is it just a warning or a true error?

cxy990729 · January 9, 2022, 1:22am

I have the same problem. Have you solved it？

Clark_Zinzow · January 10, 2022, 7:05pm

Note that this potential worker pool bug is captured in this GitHub issue: [Core] [Bug] Failed to register worker to Raylet for single node, multi-GPU · Issue #21226 · ray-project/ray · GitHub

cc @ericl @sangcho (assignees) for follow-up.

Topic		Replies	Views
[Core] Ray.init() hanging Ray Core	5	2495	December 21, 2021
Ray init fails to register workers Ray Core	9	2768	August 17, 2022
Raylet worker processes are failing Ray Core	3	120	March 5, 2025
Error while stopping a job in a ray cluster Check failed: addr_proto.worker_id() != "" Ray Clusters	0	11	June 30, 2024
Error when stopping a job Check failed: addr_proto.worker_id() != "" Ray Clusters	0	6	June 30, 2024

(raylet core_worker.cc:451: Failed to register worker to Raylet. Invalid: Invalid: Unknown worker

Related topics