I am trying to perform Ray training on cluster based on the sample scripts from documentation. I am facing an error: if any of the functions are imported from other modules the code fails. Below are two very simplified versions to reproduce the issue - the first one works where there is no import but the second one fails:
Works:
import ray
from ray.air import session, ScalingConfig
from ray.train.torch import TorchTrainer
import logging
ray.init(address='ray://xxx.xxx.xxx.xxx:10001')
def sample_func(x):
return x + 10
def train_func(config):
for i in range(config["num_epochs"]):
session.report({"epoch": i,
"val": sample_func(i)
})
config = {'num_epochs': 3}
trainer = TorchTrainer(
train_func,
train_loop_config=config,
scaling_config=ScalingConfig(num_workers=2, use_gpu=False)
)
result = trainer.fit()
Does not work:
import ray
from ray.air import session, ScalingConfig
from ray.train.torch import TorchTrainer
import logging
from utils import sample_func
ray.init(address='ray://xxx.xxx.xxx.xxx:10001')
def train_func(config):
for i in range(config["num_epochs"]):
session.report({"epoch": i,
"val": sample_func(i)
})
config = {'num_epochs': 3}
trainer = TorchTrainer(
train_func,
train_loop_config=config,
scaling_config=ScalingConfig(num_workers=2, use_gpu=False)
)
result = trainer.fit()
I get a
ModuleNotFoundError: No module named 'utils'
in the second example. Can someone explain how to avoid this issue when launching cluster jobs? I am using Ray 2.1 and the scripts are executed from a Jupyterlab notebook.
Thanks in advance.