What happen if one of the workers goes down in between execution

I have 3 workers and a head node in my cluster,
in between execution if one of the worker goes down(for mimic the scenario I have stoped ray on that worker) ,current job I am seeing failure with error as below below, -

I have some questions-

  1. is it right behaviour ?
  2. is it possible to create replica from task level , (if worker issue while executing other worker will take care of it those task executing on failure worker node)
  3. what is default heartbeat time, is it configurable , can you provide some pointers ?

2021-02-04 02:20:06,664 WARNING worker.py:1072 – The node with node id 517ec2d9b3accfad8b70f7d466bde3030071f852 has been marked dead because the detector has missed too many heartbeats from it.

RayActorError Traceback (most recent call last)
~/miniconda3/envs/ray132/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
939 if getattr(self._backend, ‘supports_timeout’, False):
→ 940 self._output.extend(job.get(timeout=self.timeout))
941 else:

~/miniconda3/envs/ray132/lib/python3.8/site-packages/ray/util/multiprocessing/pool.py in get(self, timeout)
147 elif isinstance(result, Exception):
→ 148 raise result
149 results.extend(batch)

~/miniconda3/envs/ray132/lib/python3.8/site-packages/ray/util/multiprocessing/pool.py in run(self)
74 try:
—> 75 batch = ray.get(ready_id)
76 except ray.exceptions.RayError as e:

~/miniconda3/envs/ray132/lib/python3.8/site-packages/ray/worker.py in get(object_refs, timeout)
1429 else:
→ 1430 raise value
1431

RayActorError: The actor died unexpectedly before finishing this task.

During handling of the above exception, another exception occurred:

Environment:

ray --version

ray, version 1.0.0

python -V

Python 3.8.5

  1. Yes, it is the right behavior. To understand our fault-tolerant mechanism, please check Fault Tolerance — Ray v1.1.0.

  2. We don’t have a replication mechanism, but retries are supported. Also, please check Fault Tolerance — Ray v1.1.0.

  3. The default heartbeat timeout is 30 seconds. Your logs appeared just because our central data plane couldn’t receive heartbeat from a dead worker (which makes sense!) So, configuring heartbeat time is not necessary in this case (but please ask me again if you still would like to do that. I can tell you how to).

1 Like

Also about our distributed object fault tolerant mechanism, we are using “reconstruction” instead of replication as well (so we are storing lineage of objects and replay tasks to reconstruct the lost object).