TorchTrain fails if train_func imports functions from a different file

I am trying to perform Ray training on cluster based on the sample scripts from documentation. I am facing an error: if any of the functions are imported from other modules the code fails. Below are two very simplified versions to reproduce the issue - the first one works where there is no import but the second one fails:

Works:

import ray
from ray.air import session, ScalingConfig
from ray.train.torch import TorchTrainer
import logging

ray.init(address='ray://xxx.xxx.xxx.xxx:10001')

def sample_func(x):
    return x + 10

def train_func(config):
    for i in range(config["num_epochs"]):
        session.report({"epoch": i,
                        "val": sample_func(i)
                       })

config = {'num_epochs': 3}

trainer = TorchTrainer(
    train_func,
    train_loop_config=config,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=False)
)
result = trainer.fit()

Does not work:

import ray
from ray.air import session, ScalingConfig
from ray.train.torch import TorchTrainer
import logging
from utils import sample_func

ray.init(address='ray://xxx.xxx.xxx.xxx:10001')


def train_func(config):
    for i in range(config["num_epochs"]):
        session.report({"epoch": i,
                        "val": sample_func(i)
                       })

config = {'num_epochs': 3}

trainer = TorchTrainer(
    train_func,
    train_loop_config=config,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=False)
)
result = trainer.fit()

I get a

ModuleNotFoundError: No module named 'utils'

in the second example. Can someone explain how to avoid this issue when launching cluster jobs? I am using Ray 2.1 and the scripts are executed from a Jupyterlab notebook.

Thanks in advance.

You’d need to create an installable package and install it on every node in your cluster or include it in a Ray runtime environment.

Thanks @Yard1 - would copying the modules and setting PYTHONPATH correctly on each worker node also work?

Hey @Asad_Hasan, that should work, yes. You can use Ray runtime environments (Environment Dependencies — Ray 2.1.0) to make sure the env var is set correctly across the cluster.

Thanks for the quick clarification!

Let us know how that goes!

Thanks a lot @Yard1 - I was able reach similar solution to what you had mentioned - copying files and folders to each of the nodes and setting PYTHONPATH correctly.

1 Like