ray.exceptions.ObjectLostError: Object xxx is lost due to node failure

When migrating from ray 0.8.0 to version 1.1.0 on some of our codebase we now get fatal errors like

ray.exceptions.ObjectLostError: Object xxx is lost due to node failure

it then spills the local object store memory usage which shows plenty of room ie, less than 5% used.

Suggestions on how to diagnose the issue?

We do have a lot of ray functions that store objectrefs into python lists/dictionaries before returning and it appears initially that if we remove that construct and return the actual objects the error goes away. Unfortunately 1) doing this everywhere will involve a decent amount of refactoring and 2) doing so makes the job execution time 3-5x worse (15 minutes -> over an hour as one example).

I have not been able to reproduce this with a shareable example yet, It seems to run fine with shorter jobs with similar patterns.

thanks,
Luke

OS: redhat 7.7
python: 3.8.6

Unfortunately I’m not sure if we have very good tracing tools for this. You can try ray memory to see the current references that are in scope, but that’s generally better for tracing memory leaks rather than missing references.

Are you running this on a single node? If so, could you start/run ray with the environment variable RAY_BACKEND_LOG_LEVEL=debug? Then, you could collect the logs in /tmp/ray/session_latest/logs and I can try to take a look and see if there’s anything immediately fishy. It’s possible we have a ref counting bug, but we can’t help much unless we have a reproducible example.

I am not running on a single node. Currently, this particular code is running on a 12 node cluster with 672 total cores. I guess I could set RAY_BACKEND_LOG_LEVEL=debug before the ray start command for each node and look at the logs when it crashes.

Have you tried running a smaller version of the job? It would be good to check whether this causes the same crash on a smaller cluster size and on a single node. If this isn’t doable, then you can attach the logs for all of the nodes, but it would really be preferable to have it run on a single node or have a reproducible example.

I have not been able to get a smaller version and or one that runs on a single node to show the issue (yet). I do have a version that reliably 100% of the time crashes but it is a large fairly complicated program and I have been unable to make a representative program displaying the issue. When I change all of the returns to return the objects and all the subsequent changes to handle the new return types I no longer have an issue, except for the extreme slowness in comparison.

1 Like

I am also facing this issue currently.