Why py module not updated on the slave node?

I am using a 2 machine cluster. And I am debugging from master node. The issue now is that a submodule (called sage.py in the following) is not updated on the slave node.

In the main routine, I define this remote with debug output:

@ray.remote
class Client(object):
    def __init__(self, args, rank):
        self.args = args
        self.rank = rank

    def work(self):
        config = LaunchConfig(
            min_nodes=1,
            max_nodes=1,
            nproc_per_node=1,
            rdzv_endpoint="localhost:0",
            rdzv_backend="c10d",
        )

        print('DDDDDD', flush=True)
        print([x for x in dir(sage) if 'e' in x])
        sage.test()
        outputs = elastic_launch(config, sage.sage_main_routine)(
            Path(FOLDER), args, self.rank
        )

        return True

And I got output like this:

Traceback (most recent call last):
  File "train_ray.py", line 174, in <module>
    ray.get([client1.work.remote(), client2.work.remote()])
  File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/worker.py", line 1481, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::Client.work() (pid=7905, ip=10.231.21.63)
  File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
  File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/_private/function_manager.py", line 556, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "train_ray.py", line 104, in work
AttributeError: module 'sage' has no attribute 'test'
(pid=None, ip=10.231.21.63) DDDDDD
**(pid=None, ip=10.231.21.63) ['DataLoader', 'DistDataLoader', 'NeighborSampler', 'Process', '__cached__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'argparse', 'compute_acc', 'create_model', 'evaluate', 'load_subtensor', 'register_data_args', 'time']**
(pid=None) DDDDDD
**(pid=None) ['DataLoader', 'DistDataLoader', 'NeighborSampler', 'Process', '__cached__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'argparse', 'compute_acc', 'create_model', 'evaluate', 'load_subtensor', 'register_data_args', 'sage_main_routine', 'test', 'time']**

See the difference between two dir() calls. The master node has “test” and “sage_main_routine” in it, but the slave does not. Looks like an older version of sage.py module get stuck in the system, and we have to clean up cache somewhere? Suggestions?

Looks like I have to update the sub-module py file on salve node, to the same path.

Are we supposed to always keep all the sub-module py files sync-up on all nodes?

Yes you are right! I believe rsync option in the cluster.yaml does this, but you should make sure all the path is properly set in all nodes. The new runtime env APIs will solve this, but it is not currently GA yet. (Advanced Usage — Ray v2.0.0.dev0)

1 Like