How to find the reason of a node failure?

Deathn0t · May 11, 2021, 2:11pm

I am running a cluster on 2 nodes each equipped with 8 GPUs. I am using a remote function to load and evaluate tensorflow.keras models in parallel. However, I am having the following error which does not happen when I am using a single CPU. Still after digging into the ray logs I am not able to find helpful information to understand what I am doing wrong.

2021-05-11 14:01:03,505	INFO worker.py:640 -- Connecting to existing Ray cluster at address: 10.230.2.207:6379
Uncaught exception <class 'ray.exceptions.RayTaskError'>: ray::evaluate_model() (pid=238430, ip=10.230.2.196)
  File "python/ray/_raylet.pyx", line 458, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 349, in ray._raylet.raise_if_dependency_failed
ray.exceptions.ObjectLostError: Object ffffffffffffffffffffffffffffffffffffffff010000001b000000 is lost due to node failure.Traceback (most recent call last):
  File "/lus/grand/projects/datascience/regele/thetagpu/deep-ensembles/dhgpu/bin/deephyper", line 33, in <module>
    sys.exit(load_entry_point('deephyper', 'console_scripts', 'deephyper')())
  File "/lus/grand/projects/datascience/regele/thetagpu/deep-ensembles/deephyper/deephyper/core/cli/cli.py", line 51, in main
    func(**kwargs)
  File "/lus/grand/projects/datascience/regele/thetagpu/deep-ensembles/deephyper/deephyper/core/cli/ensemble.py", line 54, in main
    ensemble_obj.fit(vx, vy)
  File "/lus/grand/projects/datascience/regele/thetagpu/deep-ensembles/deephyper/deephyper/nas/ensemble/uq_bagging_ensemble.py", line 133, in fit
    self.greedy_selection(X, y)
  File "/lus/grand/projects/datascience/regele/thetagpu/deep-ensembles/deephyper/deephyper/nas/ensemble/uq_bagging_ensemble.py", line 144, in greedy_selection
    model_files, model_losses = self.sort_models_by_min_loss(model_files, X, y)
  File "/lus/grand/projects/datascience/regele/thetagpu/deep-ensembles/deephyper/deephyper/nas/ensemble/uq_bagging_ensemble.py", line 114, in sort_models_by_min_loss
    model_losses = ray.get(model_losses)
  File "/lus/grand/projects/datascience/regele/thetagpu/deep-ensembles/dhgpu/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/lus/grand/projects/datascience/regele/thetagpu/deep-ensembles/dhgpu/lib/python3.8/site-packages/ray/worker.py", line 1481, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::evaluate_model() (pid=238430, ip=10.230.2.196)
  File "python/ray/_raylet.pyx", line 458, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 479, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 349, in ray._raylet.raise_if_dependency_failed
ray.exceptions.ObjectLostError: Object ffffffffffffffffffffffffffffffffffffffff010000001b000000 is lost due to node failure.

sangcho · May 12, 2021, 8:41pm

What version of Ray are you using?

rliaw · May 12, 2021, 11:47pm

@Deathn0t are you also using the Ray Client?

Deathn0t · May 13, 2021, 7:43am

I am using ray 1.3.0.

I am not sure about the Ray Client. I am starting the cluster with ray start and using ray.init(address="auto") to connect the driver. I am looking at the logs in /tmp/ray/session_latest.

sangcho · May 13, 2021, 9:37pm

I am curious. If you just use a single node, can you reproduce the issue?

Topic		Replies	Views
Node fault tolerance in Ray Data Ray Data	2	41	January 10, 2025
ray.exceptions.ObjectLostError: Object xxx is lost due to node failure Ray Core	5	622	July 26, 2021
[RaySGD] Training instability Ray Train	6	1052	March 17, 2021
ObjectLostError Ray Clusters	4	381	July 15, 2021
Entire ray cluster dying unexpectedly Ray Core	11	1048	September 20, 2023

How to find the reason of a node failure?

Related topics