ModuleNotFoundError for torch

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m new to Ray. To try a sample code, I’m using Fine-tune a Hugging Face Transformers Model — Ray 2.40.0 step by step. Everything looks good until I get to trainer.fit(), the error message I get is:

ModuleNotFoundError: No module named ‘torch’.

(I’ve added the full error at the of the message.)

I can simply import torch, or import ray.train.torch. The only problem is the trainer.fit().

additional information:
pytorch has been installed from root for all users, the path looks like this:
</opt/data/python/pytorch/venv/lib/python3.9/site-packages/>
version is: 2.5.1+cu124
ray version: 2.40.0
python version: 3.9
I’m running the codes in a jupyter notebook.

can you let me know what is the problem?

trainer.fit()


ModuleNotFoundError Traceback (most recent call last)
Cell In[37], line 1
----> 1 trainer.fit()

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/train/base_trainer.py:580, in BaseTrainer.fit(self)
577 from ray.tune import ResumeConfig, TuneError
578 from ray.tune.tuner import Tuner
→ 580 trainable = self.as_trainable()
581 param_space = self._extract_fields_for_tuner_param_space()
583 self.run_config.name = (
584 self.run_config.name or StorageContext.get_experiment_dir_name(trainable)
585 )

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/train/base_trainer.py:827, in BaseTrainer.as_trainable(self)
824 trainable_cls = self._generate_trainable_cls()
826 # Wrap with tune.with_parameters to handle very large values in base_config
→ 827 return tune.with_parameters(trainable_cls, **base_config)

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/tune/trainable/util.py:107, in with_parameters(trainable, **kwargs)
105 prefix = f"{str(trainable)}_"
106 for k, v in kwargs.items():
→ 107 parameter_registry.put(prefix + k, v)
109 trainable_name = getattr(trainable, “name”, “tune_with_parameters”)
110 keys = set(kwargs.keys())

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/tune/registry.py:301, in _ParameterRegistry.put(self, k, v)
299 self.to_flush[k] = v
300 if ray.is_initialized():
→ 301 self.flush()

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/tune/registry.py:313, in _ParameterRegistry.flush(self)
311 self.references[k] = v
312 else:
→ 313 self.references[k] = ray.put(v)
314 self.to_flush.clear()

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/_private/auto_init_hook.py:21, in wrap_auto_init..auto_init_wrapper(*args, **kwargs)
18 @wraps(fn)
19 def auto_init_wrapper(*args, **kwargs):
20 auto_init_ray()
—> 21 return fn(*args, **kwargs)

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/_private/client_mode_hook.py:102, in client_mode_hook..wrapper(*args, **kwargs)
98 if client_mode_should_convert():
99 # Legacy code
100 # we only convert init function if RAY_CLIENT_MODE=1
101 if func.name != “init” or is_client_mode_enabled_by_default:
→ 102 return getattr(ray, func.name)(*args, **kwargs)
103 return func(*args, **kwargs)

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/util/client/api.py:52, in _ClientAPI.put(self, *args, **kwargs)
44 def put(self, *args, **kwargs):
45 “”“put is the hook stub passed on to replace ray.put
46
47 Args:
(…)
50 kwargs: opaque keyword arguments
51 “””
—> 52 return self.worker.put(*args, **kwargs)

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/util/client/worker.py:495, in Worker.put(self, val, client_ref_id, _owner)
487 raise TypeError(
488 "Calling ‘put’ on an ObjectRef is not allowed "
489 "(similarly, returning an ObjectRef from a remote "
(…)
492 “call ‘put’ on it (or return it).”
493 )
494 data = dumps_from_client(val, self._client_id)
→ 495 return self._put_pickled(data, client_ref_id, _owner)

File /opt/data/python/ray/venv/lib/python3.9/site-packages/ray/util/client/worker.py:509, in Worker._put_pickled(self, data, client_ref_id, owner)
507 if not resp.valid:
508 try:
→ 509 raise cloudpickle.loads(resp.error)
510 except (pickle.UnpicklingError, TypeError):
511 logger.exception(“Failed to deserialize {}”.format(resp.error))

ModuleNotFoundError: No module named ‘torch’

From the stacktrace it seems like you are using Ray Client, which may be causing the issue. Can you try running without Ray Client?

I’m using a server that has the ray installed for us as users. I want to have access to different nodes in our cluster and just simply train my model. Before doing this, I’ve started using the sample code provided in the post (get familiar with Ray). I don’t know how actually ray has been installed for us, since I am not the admin. But what would be the difference? If I’m running from Ray Client I should be able able have access to my cluster with ray.init() or the setup is different? I need the information to tell the admin installing it differently.

second attempt on my local machine: I got to start training (after so many solvable errors :smiley: ). But I now get:

“RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from Download The Latest Official NVIDIA Drivers

Is that training sample only running on GPU? even though I’ve set gpu==False.