Whenever I push the cluster a little I am first hit with logs of the nature:
Available resources on this node: {0.000000/80.000000 CPU, 6553970890.039062 GiB/10590988490.039062 GiB memory, 4748258459.960938 GiB/4748258459.960938 GiB object_store_memory, 1.000000/1.000000}
In total there are 1 pending tasks and 33 pending actors on this node.
This isn’t so scary because I know that I am pushing the cluster but all my actors exit once they finish to ensure progress (no deadlocks).
Butt later I am hit with
(raylet, ip=XX.XX.XX.XXX) [2021-12-15 08:24:54] core_worker.cc:451: Failed to register worker worker-id to Raylet. Invalid: Invalid: Unknown worker
This is a little more scary as I have no idea what it mean, and my job doesn’t seem to do everything that it was supposed to (I am assuming some of the actors didn’t make it)
So I was just wondering if anyone could give me some insight into this error. I understand that it is likely a result of pushing the cluster a little to hard, but please note that I am unable to gauge the exact resources in my application ahead of time and just make a rough guess. However it is designed such that when the cluster becomes full forward progress is still ensured as occupying actors exit and release their resources.
How can I avoid it? Is it just a warning or a true error?