Module not found when in tuning jons

pamparana · April 15, 2022, 6:38am

I am using ray tune for optimizing some deep learning model.

I am currently getting an error like:

TemporaryActor pid=90906) Traceback (most recent call last):
(TemporaryActor pid=90906)   File "/Users/luca/opt/anaconda3/envs/mlmod/lib/python3.9/site-packages/ray/_private/function_manager.py", line 594, in _load_actor_class_from_gcs
(TemporaryActor pid=90906)     actor_class = pickle.loads(pickled_class)
(TemporaryActor pid=90906) ModuleNotFoundError: No module named 'mlmod'

mlmod is the package module. I had had similar setup before with optimizing time series models and that always worked.

So, my code is something like:

ray.init(ignore_reinit_error=True)
result = tune.run(
        tune.with_parameters(train_model, data=data, hydra_config=config, hydra_state=state),
        resources_per_trial=resources_per_trial,
        config=search_config,
        num_samples=num_samples,
        metric="loss",
        mode="min",
        scheduler=scheduler,
        # TODO: We will probably need to add this if we run ray on the cloud.
        # sync_config=tune.SyncConfig(upload_dir="s3://something"),
        resume="AUTO",
    )

def train_model(ray_config, data, hydra_config: DictConfig, hydra_state: Any):
    # required to avoid  https://github.com/facebookresearch/hydra/issues/903
    Singleton.set_state(hydra_state)
    # map ray tune parameters to hydra parameters
    for param, value in ray_config.items():
        OmegaConf.update(hydra_config, param, value, merge=False)

    
    from mlmod.apps.train import train
    loss = train(hydra_config, None)
    tune.report(loss=loss)

and the called train function at the moment, just does:

// file: mlmod/apps/train.py
def train(config: DictConfig, datamodule: LightningDataModule) → None:
import numpy as np

return np.random.random()

I do not quite understand what is being serialized here and why this issue is happening. I am at a loss now what I can try and how to debug this.

amogkam · April 18, 2022, 9:02pm

Hey @pamparana thanks for raising the issue! Can you tell me a bit more about your setup? Is this being run on multiple nodes? Is mlmod installed on every single node?

amogkam · April 18, 2022, 9:15pm

Here are some other threads which might provide some useful information

github.com/ray-project/ray

[tune] ModuleNotFoundError for tuning script

opened 08:57AM - 12 Aug 20 UTC

closed 05:07AM - 15 Aug 20 UTC

guoxuxu

question

### What is the problem? ### ModuleNotFoundError: No module named 'functions' … Does it seem the ray_tune.py file must stay in the same directory with data/ ??? Otherwise, it doesn't work even if I put ```'../'``` in the data loader function... Specifically, I run the same ray_tune.py in the parent folder and child folder respectively, but only got success when I run under the parent folder. Cannot provide reproducible code here, because I used my own dataset. This issue occurred whenever I switched the ray_tune.py to a child directory (has ```../```)... *Ray version and other system information (Python version, TensorFlow version, OS):* Ray 0.8.5 Python 3.7.4 Pytorch 1.6 TensorFlow: 2.2.0 OS linux ![image](https://user-images.githubusercontent.com/29363464/89991868-0937d000-dcb7-11ea-9637-e5bdf15f847b.png) ### Reproduction (REQUIRED) Please provide a script that can be run to reproduce the issue. The script should have **no external library dependencies** (i.e., use fake or mock data / environments): If we cannot run your script, we cannot fix your issue. - [x] I have verified my script runs in a clean environment and reproduces the issue. - [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/latest/installation.html).

github.com/ray-project/ray

Ray Cluster ModuleNotFoundError

opened 12:55PM - 04 Sep 19 UTC

closed 10:17PM - 21 Jan 20 UTC

mynkpl1998

### System information - **OS Platform and Distribution**: Ubuntu 16.04.2 LTS - **Ray installed from (source or binary)**: Binary - **Ray version**: 0.7.2 - **Python version**: 3.6.8 I am trying to build a manual cluster of the machines with IP Addresses. However, When I tried to run the PPO algorithm on the cluster I got an error message from one of the workers complaining about ModuleNotFoundError: No module named "v2i". Here the main module is my custom gym environment. It looks like ray could not able to sync the files between different nodes. Here is the complete traceback. **wsl** is my worker hostname. ``` Traceback (most recent call last): File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 436, in _process_trial result = self.trial_executor.fetch_result(trial) File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 323, in fetch_result result = ray.get(trial_future[0]) File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/worker.py", line 2195, in get raise value ray.exceptions.RayTaskError: [36mray_PPO:train()[39m (pid=30729, host=rlmac) File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 364, in train raise e File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/rllib/agents/trainer.py", line 353, in train result = Trainable.train(self) File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/tune/trainable.py", line 150, in train result = self._train() File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/rllib/agents/trainer_template.py", line 126, in _train fetches = self.optimizer.step() File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 130, in step self.num_envs_per_worker, self.train_batch_size) File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/rllib/optimizers/rollout.py", line 29, in collect_samples next_sample = ray_get_and_free(fut_sample) File "/home/mayank/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/rllib/utils/memory.py", line 33, in ray_get_and_free result = ray.get(object_ids) ray.exceptions.RayTaskError: [36mray_RolloutWorker:sample()[39m (pid=11974, host=wsl) File "pyarrow/serialization.pxi", line 461, in pyarrow.lib.deserialize File "pyarrow/serialization.pxi", line 424, in pyarrow.lib.deserialize_from File "pyarrow/serialization.pxi", line 275, in pyarrow.lib.SerializedPyObject.deserialize File "pyarrow/serialization.pxi", line 174, in pyarrow.lib.SerializationContext._deserialize_callback File "/media/win/MayankPal/miniconda3/envs/v2i/lib/python3.6/site-packages/ray/cloudpickle/cloudpickle.py", line 965, in subimport __import__(name) ModuleNotFoundError: No module named 'v2i' ```  ### Describe the problem ### Source code / logs * First start the ray head `ray start --head --redis-port=6666 --num-cpus=22 --num-gpus=1` * Start ray on worker machine with above redis address `ray start --redis-address=xxx.xxx.xxx.xxx:6666` * Start PPO training `python train.py`

In general, it is recommended to not rely on relative paths/imports with ray tune since the working directory of the training function will be changed and is not the same as what’s on the driver.

Topic		Replies	Views
Trainable not found -- 1.9.0 Ray Tune	4	736	December 7, 2021
"ModuleNotFoundError: No module named in" when connecting in client mode Ray Tune	3	2335	November 15, 2021
ModuleNotFoundError for torch Ray Tune	2	92	December 20, 2024
Making Custom Python Modules Available in RayTune Workers	2	117	June 21, 2024
ModuleNotFoundError: No module named 'run_unittests' RLlib	2	753	February 12, 2023

Module not found when in tuning jons

Related topics