Module not found when in tuning jons

I am using ray tune for optimizing some deep learning model.

I am currently getting an error like:

TemporaryActor pid=90906) Traceback (most recent call last):
(TemporaryActor pid=90906)   File "/Users/luca/opt/anaconda3/envs/mlmod/lib/python3.9/site-packages/ray/_private/function_manager.py", line 594, in _load_actor_class_from_gcs
(TemporaryActor pid=90906)     actor_class = pickle.loads(pickled_class)
(TemporaryActor pid=90906) ModuleNotFoundError: No module named 'mlmod'

mlmod is the package module. I had had similar setup before with optimizing time series models and that always worked.

So, my code is something like:

ray.init(ignore_reinit_error=True)
result = tune.run(
        tune.with_parameters(train_model, data=data, hydra_config=config, hydra_state=state),
        resources_per_trial=resources_per_trial,
        config=search_config,
        num_samples=num_samples,
        metric="loss",
        mode="min",
        scheduler=scheduler,
        # TODO: We will probably need to add this if we run ray on the cloud.
        # sync_config=tune.SyncConfig(upload_dir="s3://something"),
        resume="AUTO",
    )

def train_model(ray_config, data, hydra_config: DictConfig, hydra_state: Any):
    # required to avoid  https://github.com/facebookresearch/hydra/issues/903
    Singleton.set_state(hydra_state)
    # map ray tune parameters to hydra parameters
    for param, value in ray_config.items():
        OmegaConf.update(hydra_config, param, value, merge=False)

    
    from mlmod.apps.train import train
    loss = train(hydra_config, None)
    tune.report(loss=loss)

and the called train function at the moment, just does:

// file: mlmod/apps/train.py
def train(config: DictConfig, datamodule: LightningDataModule) → None:
import numpy as np

return np.random.random()

I do not quite understand what is being serialized here and why this issue is happening. I am at a loss now what I can try and how to debug this.

Hey @pamparana thanks for raising the issue! Can you tell me a bit more about your setup? Is this being run on multiple nodes? Is mlmod installed on every single node?

Here are some other threads which might provide some useful information

In general, it is recommended to not rely on relative paths/imports with ray tune since the working directory of the training function will be changed and is not the same as what’s on the driver.