We are running a Dask on Ray workload and are occasionally seeing this error where the calculation fails when a node is lost. My expectation was that if a node was lost Ray could regenerate any missing intermediate results. Are Dask on Ray workloads safe to run on Spot instances where failures are expected?
Thanks
Traceback (most recent call last):
File “/Users/xxx/PycharmProjects/RelayTxRay/ray-test.py”, line 37, in
main()
File “/Users/xxx/PycharmProjects/RelayTxRay/ray-test.py”, line 30, in main
df.to_json(‘s3://xxx/xxx/ray/dask/xxx’, compression=‘gzip’)
File “/usr/local/Caskroom/miniconda/base/envs/ray-test2/lib/python3.7/site-packages/dask/dataframe/core.py”, line 1510, in to_json
return to_json(self, filename, *args, **kwargs)
File “/usr/local/Caskroom/miniconda/base/envs/ray-test2/lib/python3.7/site-packages/dask/dataframe/io/json.py”, line 88, in to_json
dask_compute(parts, **compute_kwargs)
File “/usr/local/Caskroom/miniconda/base/envs/ray-test2/lib/python3.7/site-packages/dask/base.py”, line 567, in compute
results = schedule(dsk, keys, **kwargs)
File “/usr/local/Caskroom/miniconda/base/envs/ray-test2/lib/python3.7/site-packages/ray/util/dask/scheduler.py”, line 457, in ray_dask_get_sync
result = ray_get_unpack(object_refs)
File “/usr/local/Caskroom/miniconda/base/envs/ray-test2/lib/python3.7/site-packages/ray/util/dask/scheduler.py”, line 381, in ray_get_unpack
computed_result = ray.get(object_refs)
File “/usr/local/Caskroom/miniconda/base/envs/ray-test2/lib/python3.7/site-packages/ray/_private/client_mode_hook.py”, line 61, in wrapper
return getattr(ray, func.name)(*args, **kwargs)
File “/usr/local/Caskroom/miniconda/base/envs/ray-test2/lib/python3.7/site-packages/ray/util/client/api.py”, line 42, in get
return self.worker.get(vals, timeout=timeout)
File “/usr/local/Caskroom/miniconda/base/envs/ray-test2/lib/python3.7/site-packages/ray/util/client/worker.py”, line 225, in get
res = self._get(obj_ref, op_timeout)
File “/usr/local/Caskroom/miniconda/base/envs/ray-test2/lib/python3.7/site-packages/ray/util/client/worker.py”, line 248, in _get
raise err
ray.exceptions.RayTaskError: ray::dask:write_json_partition-be1172be-97d9-45e2-85d2-726504b22e68 (pid=181, ip=10.16.65.108)
File “python/ray/_raylet.pyx”, line 460, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 481, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 351, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::dask:(‘assign-a1ecc1c1a981a7ee5ea34bf021e663e3’, 0) (pid=187, ip=10.16.65.108)
File “python/ray/_raylet.pyx”, line 460, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 481, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 351, in ray._raylet.raise_if_dependency_failed
ray.exceptions.ObjectLostError: Object a87b3e758b494778ffffffffffffffffffffffff0500000001000000 is lost due to node failure.
Process finished with exit code 1