I have 3 workers and a head node in my cluster,
in between execution if one of the worker goes down(for mimic the scenario I have stoped ray on that worker) ,current job I am seeing failure with error as below below, -
I have some questions-
- is it right behaviour ?
- is it possible to create replica from task level , (if worker issue while executing other worker will take care of it those task executing on failure worker node)
- what is default heartbeat time, is it configurable , can you provide some pointers ?
2021-02-04 02:20:06,664 WARNING worker.py:1072 â The node with node id 517ec2d9b3accfad8b70f7d466bde3030071f852 has been marked dead because the detector has missed too many heartbeats from it.
RayActorError Traceback (most recent call last)
~/miniconda3/envs/ray132/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
939 if getattr(self._backend, âsupports_timeoutâ, False):
â 940 self._output.extend(job.get(timeout=self.timeout))
941 else:
~/miniconda3/envs/ray132/lib/python3.8/site-packages/ray/util/multiprocessing/pool.py in get(self, timeout)
147 elif isinstance(result, Exception):
â 148 raise result
149 results.extend(batch)
~/miniconda3/envs/ray132/lib/python3.8/site-packages/ray/util/multiprocessing/pool.py in run(self)
74 try:
â> 75 batch = ray.get(ready_id)
76 except ray.exceptions.RayError as e:
~/miniconda3/envs/ray132/lib/python3.8/site-packages/ray/worker.py in get(object_refs, timeout)
1429 else:
â 1430 raise value
1431
RayActorError: The actor died unexpectedly before finishing this task.
During handling of the above exception, another exception occurred:
Environment:
ray --version
ray, version 1.0.0
python -V
Python 3.8.5