Debugging inside cv.wait_for()

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task. Downgraded…now have a workaround

There is a task that I send to a ray cluster via ray.remote(callable).remote(**kwargs)

All works fine until I do an import from a module within the module where this callable resides.
If I do the import then I get this Traceback:

File "/home/fboon/code/app/ray/python/ray/remote_function.py", line 250, in remote
    return func_cls._remote(args=args, kwargs=kwargs, **updated_options)
  File "/home/fboon/code/app/ray/python/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/fboon/code/app/ray/python/ray/util/tracing/tracing_helper.py", line 310, in _invocation_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/home/fboon/code/app/ray/python/ray/remote_function.py", line 272, in _remote
    return client_mode_convert_function(self, args, kwargs, **task_options)
  File "/home/fboon/code/app/ray/python/ray/_private/client_mode_hook.py", line 164, in client_mode_convert_function
    return client_func._remote(in_args, in_kwargs, **kwargs)
  File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 308, in _remote
    return self.options(**option_args).remote(*args, **kwargs)
  File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 599, in remote
    return return_refs(ray.call_remote(self, *args, **kwargs))
  File "/home/fboon/code/app/ray/python/ray/util/client/api.py", line 100, in call_remote
    return self.worker.call_remote(instance, *args, **kwargs)
  File "/home/fboon/code/app/ray/python/ray/util/client/worker.py", line 555, in call_remote
    task = instance._prepare_client_task()
  File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 605, in _prepare_client_task
    task = self._remote_stub._prepare_client_task()
  File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 334, in _prepare_client_task
    self._ensure_ref()
  File "/home/fboon/code/app/ray/python/ray/util/client/common.py", line 329, in _ensure_ref
    self._ref = ray.worker._put_pickled(
  File "/home/fboon/code/app/ray/python/ray/util/client/worker.py", line 506, in _put_pickled
    resp = self.data_client.PutObject(req)
  File "/home/fboon/code/app/ray/python/ray/util/client/dataclient.py", line 568, in PutObject
    resp = self._blocking_send(datareq)
  File "/home/fboon/code/app/ray/python/ray/util/client/dataclient.py", line 458, in _blocking_send
    self._check_shutdown()
  File "/home/fboon/code/app/ray/python/ray/util/client/dataclient.py", line 511, in _check_shutdown
    raise ConnectionError(msg)
ConnectionError: Request can't be sent because the Ray client has already been disconnected due to an error. Last exception: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.NOT_FOUND
	details = "Attempted to reconnect a session that has already been cleaned up"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Attempted to reconnect a session that has already been cleaned up", grpc_status:5, created_time:"2024-09-05T14:03:36.081482247+01:00"}"

The error happens even if I put the import within a try/except

I was initially using 2.9.3 but I see the exact same issue with 2.35.0.
I did an editable install of a local version of 2.35.0 to add some debugging, but it hasn’t helped isolate beyond that it happens in this line:

So within cv.wait_for()

I can’t see how to debug inside that.

RAY_PDB=1 isn’t opening a debugger on this crash

Pointers on how to debug very welcome!

Seems a ray core question. Moving it to ray core category.
You could also try the vscode debugger integration Easily Debug Ray Applications with Ray Distributed Debugger for debugging.

Thanks, a handy thing to know about for sure, however not useful here as I’ve tried a breakpoint()…this is outside that…

Intrerestingly I can do the import inside the function just fine…unfortunately that’s no use in this case as the function is to add decorators to the function.

This may be related…this is Python inside the Docker image which runs the cluster nodes:

# python
>>> from hamilton import function_modifiers
>>> import ray
>>> ray.init()
ERROR: Flag 'grpc_experiments' was defined more than once but with differing types. Defined in files 'home/coder/grpc/src/core/lib/config/config_vars.cc' and 'src/core/lib/config/config_vars.cc'.
#

i.e. Python just exits with no message

Tried pip uninstall xgboost (I don’t need it, it came with the base image)

See this link for xgboost hint:

I still think that this is some problem with the base image: nvcr.io/nvidia/pytorch:24.07-py3

Although why importing this module would trigger that I don’t know

This is the module being imported:

OK, all works if I disable hamilton plugins before importing it:

>>> import os
>>> os.environ["HAMILTON_AUTOLOAD_EXTENSIONS"] = "0"

I don’t know if Ray can catch this error & work around it, or at least make it display in the logs…