What happen if one of the workers goes down in between execution

Subhabrata_Banerjee · February 5, 2021, 4:08am

I have 3 workers and a head node in my cluster,
in between execution if one of the worker goes down(for mimic the scenario I have stoped ray on that worker) ,current job I am seeing failure with error as below below, -

I have some questions-

is it right behaviour ?
is it possible to create replica from task level , (if worker issue while executing other worker will take care of it those task executing on failure worker node)
what is default heartbeat time, is it configurable , can you provide some pointers ?

2021-02-04 02:20:06,664 WARNING worker.py:1072 – The node with node id 517ec2d9b3accfad8b70f7d466bde3030071f852 has been marked dead because the detector has missed too many heartbeats from it.

RayActorError Traceback (most recent call last)
~/miniconda3/envs/ray132/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
939 if getattr(self._backend, ‘supports_timeout’, False):
→ 940 self._output.extend(job.get(timeout=self.timeout))
941 else:

~/miniconda3/envs/ray132/lib/python3.8/site-packages/ray/util/multiprocessing/pool.py in get(self, timeout)
147 elif isinstance(result, Exception):
→ 148 raise result
149 results.extend(batch)

~/miniconda3/envs/ray132/lib/python3.8/site-packages/ray/util/multiprocessing/pool.py in run(self)
74 try:
—> 75 batch = ray.get(ready_id)
76 except ray.exceptions.RayError as e:

~/miniconda3/envs/ray132/lib/python3.8/site-packages/ray/worker.py in get(object_refs, timeout)
1429 else:
→ 1430 raise value
1431

RayActorError: The actor died unexpectedly before finishing this task.

During handling of the above exception, another exception occurred:

Environment:

ray --version

ray, version 1.0.0

python -V

Python 3.8.5

sangcho · February 5, 2021, 6:13am

Yes, it is the right behavior. To understand our fault-tolerant mechanism, please check Fault Tolerance — Ray v1.1.0.
We don’t have a replication mechanism, but retries are supported. Also, please check Fault Tolerance — Ray v1.1.0.
The default heartbeat timeout is 30 seconds. Your logs appeared just because our central data plane couldn’t receive heartbeat from a dead worker (which makes sense!) So, configuring heartbeat time is not necessary in this case (but please ask me again if you still would like to do that. I can tell you how to).

sangcho · February 5, 2021, 8:28am

Also about our distributed object fault tolerant mechanism, we are using “reconstruction” instead of replication as well (so we are storing lineage of objects and replay tasks to reconstruct the lost object).

Topic		Replies	Views
Ray worker behaviour Ray Core	8	609	April 10, 2023
Ray job is stuck when node worker runs on is killed Ray Core	3	1717	July 1, 2022
When I increase the number of workers, the actor died and the parameter server failed due to lagging heartbeats Ray Clusters	2	1153	November 28, 2022
RayActorError: The actor died unexpectedly before finishing this task	2	1503	November 1, 2022
How to understand the "unexpected error" messaging	3	85	November 6, 2024

What happen if one of the workers goes down in between execution

2021-02-04 02:20:06,664 WARNING worker.py:1072 – The node with node id 517ec2d9b3accfad8b70f7d466bde3030071f852 has been marked dead because the detector has missed too many heartbeats from it.

Related topics