ray.exceptions.ObjectLostError: Object xxx is lost due to node failure

virtualluke · January 21, 2021, 1:28pm

When migrating from ray 0.8.0 to version 1.1.0 on some of our codebase we now get fatal errors like

ray.exceptions.ObjectLostError: Object xxx is lost due to node failure

it then spills the local object store memory usage which shows plenty of room ie, less than 5% used.

Suggestions on how to diagnose the issue?

We do have a lot of ray functions that store objectrefs into python lists/dictionaries before returning and it appears initially that if we remove that construct and return the actual objects the error goes away. Unfortunately 1) doing this everywhere will involve a decent amount of refactoring and 2) doing so makes the job execution time 3-5x worse (15 minutes -> over an hour as one example).

I have not been able to reproduce this with a shareable example yet, It seems to run fine with shorter jobs with similar patterns.

thanks,
Luke

OS: redhat 7.7
python: 3.8.6

Stephanie_Wang · January 21, 2021, 5:11pm

Unfortunately I’m not sure if we have very good tracing tools for this. You can try ray memory to see the current references that are in scope, but that’s generally better for tracing memory leaks rather than missing references.

Are you running this on a single node? If so, could you start/run ray with the environment variable RAY_BACKEND_LOG_LEVEL=debug? Then, you could collect the logs in /tmp/ray/session_latest/logs and I can try to take a look and see if there’s anything immediately fishy. It’s possible we have a ref counting bug, but we can’t help much unless we have a reproducible example.

virtualluke · January 21, 2021, 7:57pm

I am not running on a single node. Currently, this particular code is running on a 12 node cluster with 672 total cores. I guess I could set RAY_BACKEND_LOG_LEVEL=debug before the ray start command for each node and look at the logs when it crashes.

Stephanie_Wang · January 21, 2021, 8:42pm

Have you tried running a smaller version of the job? It would be good to check whether this causes the same crash on a smaller cluster size and on a single node. If this isn’t doable, then you can attach the logs for all of the nodes, but it would really be preferable to have it run on a single node or have a reproducible example.

virtualluke · January 22, 2021, 5:39am

I have not been able to get a smaller version and or one that runs on a single node to show the issue (yet). I do have a version that reliably 100% of the time crashes but it is a large fairly complicated program and I have been unable to make a representative program displaying the issue. When I change all of the returns to return the objects and all the subsequent changes to handle the new return types I no longer have an issue, except for the extreme slowness in comparison.

Javier_Bosch · July 26, 2021, 9:15pm

I am also facing this issue currently.

Topic		Replies	Views
Plasma Object Ownership, Actors and ObjectLostError Ray Core	3	381	July 29, 2021
ObjectLostError Ray Clusters	4	385	July 15, 2021
Ray internally deleting object store object while the reference still persist Ray Workflows	5	128	July 23, 2024
ray.exceptions.OwnerDiedError: Failed to retrieve object Ray Clusters	4	1937	July 7, 2022
[Core] Having trouble evicting objects Ray Core	6	550	June 9, 2021

ray.exceptions.ObjectLostError: Object xxx is lost due to node failure

Related topics