Trainable not found -- 1.9.0

Our tune.run flow fails with the following error in 1.9.

DEBUG:ray.tune.registry:Detected class for trainable.
DEBUG:ray.worker:Automatically increasing RLIMIT_NOFILE to max value of 1048576
Entering tune.run
2021-12-04 13:16:46,364 ERROR trial_runner.py:958 -- Trial <trial_name>: Error processing event.
Traceback (most recent call last):
  File "/home/ubuntu/Envs/<env>/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 924, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ubuntu/Envs/<env>/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 787, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ubuntu/Envs/<env>/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/Envs/<env>/lib/python3.7/site-packages/ray/worker.py", line 1715, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::<trainable_name>.__init__() (pid=25273, ip=<ip address>)
RuntimeError: The actor with name <trainable_name> failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:

ray::<trainable_name>.__init__() (pid=25273, ip=172.31.27.245)
ModuleNotFoundError: No module named '<folder>.<script in which trainable resides>'

I tried adding a runtime_env to ray.init with the working_dir set, but I couldn’t get that to work because of an error that directed me to pip install ray[default] first. Couldn’t get it to work even after running that.

I switched back to a 1.8 installation and it ran without issues. I didn’t have to specify any environments, etc. I wonder what changed?

Hi @Vishnu, I’m not sure what’s causing the regression, perhaps someone from Ray Tune can help out.

I can help debug the ModuleNotFound error in isolation. When you say it didn’t work after install ray[default], do you mean that the runtime_env setup didn’t work, or that the same ModuleNotFoundError appeared?

Are you running on a remote Ray Cluster and using Ray Client to connect to it from a local machine?
You should be able to use either working_dir or py_modules to ensure your local modules are importable on the cluster: Handling Dependencies — Ray v1.9.0

Perhaps one thing to try is to just import that module in a Ray Task and see if it works or if it raises the same ModuleNotFoundError–that might help narrow things down.

1 Like

This doesn’t look like a Ray Tune-specific issue though, but rather a general problem with Ray object serialization/runtime env propagation. I second the idea to try to run this in a regular Ray Task to see if the problem is there.

1 Like
  1. I meant the runtime_env setup didn’t work.
  2. It was a Ray cluster on an EC2 instance and I was trying to connect to it from the same instance.

In hindsight, I definitely rushed the 1.9.0 tests, I’ll try out these suggestions and then document what I find in detail here. I just wanted to get it across early in case it was a common issue I was dealing with.

1 Like

Thanks, let us know what the results are! As for the regression, I’m linking another ModuleNotFoundError regression from 1.8 to 1.9 that we’re still investigating. We don’t know the root cause but it might be related. Trainable not found -- 1.9.0 - #3 by kai

1 Like