TorchTrain fails if train_func imports functions from a different file

indrajitsg · November 23, 2022, 7:19am

I am trying to perform Ray training on cluster based on the sample scripts from documentation. I am facing an error: if any of the functions are imported from other modules the code fails. Below are two very simplified versions to reproduce the issue - the first one works where there is no import but the second one fails:

Works:

import ray
from ray.air import session, ScalingConfig
from ray.train.torch import TorchTrainer
import logging

ray.init(address='ray://xxx.xxx.xxx.xxx:10001')

def sample_func(x):
    return x + 10

def train_func(config):
    for i in range(config["num_epochs"]):
        session.report({"epoch": i,
                        "val": sample_func(i)
                       })

config = {'num_epochs': 3}

trainer = TorchTrainer(
    train_func,
    train_loop_config=config,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=False)
)
result = trainer.fit()

Does not work:

import ray
from ray.air import session, ScalingConfig
from ray.train.torch import TorchTrainer
import logging
from utils import sample_func

ray.init(address='ray://xxx.xxx.xxx.xxx:10001')


def train_func(config):
    for i in range(config["num_epochs"]):
        session.report({"epoch": i,
                        "val": sample_func(i)
                       })

config = {'num_epochs': 3}

trainer = TorchTrainer(
    train_func,
    train_loop_config=config,
    scaling_config=ScalingConfig(num_workers=2, use_gpu=False)
)
result = trainer.fit()

I get a

ModuleNotFoundError: No module named 'utils'

in the second example. Can someone explain how to avoid this issue when launching cluster jobs? I am using Ray 2.1 and the scripts are executed from a Jupyterlab notebook.

Thanks in advance.

Yard1 · November 29, 2022, 6:24pm

You’d need to create an installable package and install it on every node in your cluster or include it in a Ray runtime environment.

Asad_Hasan · November 29, 2022, 7:25pm

Thanks @Yard1 - would copying the modules and setting PYTHONPATH correctly on each worker node also work?

Yard1 · November 29, 2022, 7:42pm

Hey @Asad_Hasan, that should work, yes. You can use Ray runtime environments (Environment Dependencies — Ray 2.1.0) to make sure the env var is set correctly across the cluster.

Asad_Hasan · November 29, 2022, 8:00pm

Thanks for the quick clarification!

Yard1 · November 29, 2022, 8:03pm

Let us know how that goes!

indrajitsg · November 30, 2022, 3:13pm

Thanks a lot @Yard1 - I was able reach similar solution to what you had mentioned - copying files and folders to each of the nodes and setting PYTHONPATH correctly.

Topic		Replies	Views
Ray Train v1.9.1: returns an AttributeError: module 'ray.train' has no attribute 'torch' Ray Train	1	1894	December 29, 2021
Can I catch the original error in code outside train_func? Ray Train	5	295	November 30, 2023
ModuleNotFoundError for torch Ray Tune	2	52	December 20, 2024
Ray train examples are broken Ray Train	1	598	May 10, 2022
Failed to read the results for 1 trials	3	493	July 26, 2023

TorchTrain fails if train_func imports functions from a different file

Related topics