How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.Downgraded…now have a workaround
There is a task that I send to a ray cluster via ray.remote(callable).remote(**kwargs)
All works fine until I do an import from a module within the module where this callable resides.
If I do the import then I get this Traceback:
File "/home/fboon/code/app/ray/python/ray/remote_function.py", line 250, in remote
return func_cls._remote(args=args, kwargs=kwargs, **updated_options)
File "/home/fboon/code/app/ray/python/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/fboon/code/app/ray/python/ray/util/tracing/tracing_helper.py", line 310, in _invocation_remote_span
return method(self, args, kwargs, *_args, **_kwargs)
File "/home/fboon/code/app/ray/python/ray/remote_function.py", line 272, in _remote
return client_mode_convert_function(self, args, kwargs, **task_options)
File "/home/fboon/code/app/ray/python/ray/_private/client_mode_hook.py", line 164, in client_mode_convert_function
return client_func._remote(in_args, in_kwargs, **kwargs)
File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 308, in _remote
return self.options(**option_args).remote(*args, **kwargs)
File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 599, in remote
return return_refs(ray.call_remote(self, *args, **kwargs))
File "/home/fboon/code/app/ray/python/ray/util/client/api.py", line 100, in call_remote
return self.worker.call_remote(instance, *args, **kwargs)
File "/home/fboon/code/app/ray/python/ray/util/client/worker.py", line 555, in call_remote
task = instance._prepare_client_task()
File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 605, in _prepare_client_task
task = self._remote_stub._prepare_client_task()
File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 334, in _prepare_client_task
self._ensure_ref()
File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 329, in _ensure_ref
self._ref = ray.worker._put_pickled(
File "/home/fboon/code/app/ray/python/ray/util/client/worker.py", line 506, in _put_pickled
resp = self.data_client.PutObject(req)
File "/home/fboon/code/app/ray/python/ray/util/client/dataclient.py", line 568, in PutObject
resp = self._blocking_send(datareq)
File "/home/fboon/code/app/ray/python/ray/util/client/dataclient.py", line 458, in _blocking_send
self._check_shutdown()
File "/home/fboon/code/app/ray/python/ray/util/client/dataclient.py", line 511, in _check_shutdown
raise ConnectionError(msg)
ConnectionError: Request can't be sent because the Ray client has already been disconnected due to an error. Last exception: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.NOT_FOUND
details = "Attempted to reconnect a session that has already been cleaned up"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Attempted to reconnect a session that has already been cleaned up", grpc_status:5, created_time:"2024-09-05T14:03:36.081482247+01:00"}"
The error happens even if I put the import within a try/except
I was initially using 2.9.3 but I see the exact same issue with 2.35.0.
I did an editable install of a local version of 2.35.0 to add some debugging, but it hasn’t helped isolate beyond that it happens in this line:
So within cv.wait_for()
I can’t see how to debug inside that.
RAY_PDB=1
isn’t opening a debugger on this crash
Pointers on how to debug very welcome!