(raylet core_worker.cc:451: Failed to register worker to Raylet. Invalid: Invalid: Unknown worker

Whenever I push the cluster a little I am first hit with logs of the nature:

Available resources on this node: {0.000000/80.000000 CPU, 6553970890.039062 GiB/10590988490.039062 GiB memory, 4748258459.960938 GiB/4748258459.960938 GiB object_store_memory, 1.000000/1.000000}
 In total there are 1 pending tasks and 33 pending actors on this node.

This isn’t so scary because I know that I am pushing the cluster but all my actors exit once they finish to ensure progress (no deadlocks).

Butt later I am hit with

(raylet, ip=XX.XX.XX.XXX) [2021-12-15 08:24:54] core_worker.cc:451: Failed to register worker worker-id to Raylet. Invalid: Invalid: Unknown worker

This is a little more scary as I have no idea what it mean, and my job doesn’t seem to do everything that it was supposed to (I am assuming some of the actors didn’t make it)

So I was just wondering if anyone could give me some insight into this error. I understand that it is likely a result of pushing the cluster a little to hard, but please note that I am unable to gauge the exact resources in my application ahead of time and just make a rough guess. However it is designed such that when the cluster becomes full forward progress is still ensured as occupying actors exit and release their resources.

How can I avoid it? Is it just a warning or a true error?

I have the same problem. Have you solved it?

Note that this potential worker pool bug is captured in this GitHub issue: [Core] [Bug] Failed to register worker to Raylet for single node, multi-GPU · Issue #21226 · ray-project/ray · GitHub

cc @ericl @sangcho (assignees) for follow-up.