I am using a 2 machine cluster. And I am debugging from master node. The issue now is that a submodule (called sage.py in the following) is not updated on the slave node.
In the main routine, I define this remote with debug output:
@ray.remote
class Client(object):
def __init__(self, args, rank):
self.args = args
self.rank = rank
def work(self):
config = LaunchConfig(
min_nodes=1,
max_nodes=1,
nproc_per_node=1,
rdzv_endpoint="localhost:0",
rdzv_backend="c10d",
)
print('DDDDDD', flush=True)
print([x for x in dir(sage) if 'e' in x])
sage.test()
outputs = elastic_launch(config, sage.sage_main_routine)(
Path(FOLDER), args, self.rank
)
return True
And I got output like this:
Traceback (most recent call last):
File "train_ray.py", line 174, in <module>
ray.get([client1.work.remote(), client2.work.remote()])
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
return func(*args, **kwargs)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/worker.py", line 1481, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::Client.work() (pid=7905, ip=10.231.21.63)
File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/_private/function_manager.py", line 556, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "train_ray.py", line 104, in work
AttributeError: module 'sage' has no attribute 'test'
(pid=None, ip=10.231.21.63) DDDDDD
**(pid=None, ip=10.231.21.63) ['DataLoader', 'DistDataLoader', 'NeighborSampler', 'Process', '__cached__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'argparse', 'compute_acc', 'create_model', 'evaluate', 'load_subtensor', 'register_data_args', 'time']**
(pid=None) DDDDDD
**(pid=None) ['DataLoader', 'DistDataLoader', 'NeighborSampler', 'Process', '__cached__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'argparse', 'compute_acc', 'create_model', 'evaluate', 'load_subtensor', 'register_data_args', 'sage_main_routine', 'test', 'time']**
See the difference between two dir() calls. The master node has “test” and “sage_main_routine” in it, but the slave does not. Looks like an older version of sage.py module get stuck in the system, and we have to clean up cache somewhere? Suggestions?