Why py module not updated on the slave node?

HuangLED · June 9, 2021, 8:36pm

I am using a 2 machine cluster. And I am debugging from master node. The issue now is that a submodule (called sage.py in the following) is not updated on the slave node.

In the main routine, I define this remote with debug output:

@ray.remote
class Client(object):
    def __init__(self, args, rank):
        self.args = args
        self.rank = rank

    def work(self):
        config = LaunchConfig(
            min_nodes=1,
            max_nodes=1,
            nproc_per_node=1,
            rdzv_endpoint="localhost:0",
            rdzv_backend="c10d",
        )

        print('DDDDDD', flush=True)
        print([x for x in dir(sage) if 'e' in x])
        sage.test()
        outputs = elastic_launch(config, sage.sage_main_routine)(
            Path(FOLDER), args, self.rank
        )

        return True

And I got output like this:

Traceback (most recent call last):
  File "train_ray.py", line 174, in <module>
    ray.get([client1.work.remote(), client2.work.remote()])
  File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/worker.py", line 1481, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::Client.work() (pid=7905, ip=10.231.21.63)
  File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
  File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/ray/_private/function_manager.py", line 556, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "train_ray.py", line 104, in work
AttributeError: module 'sage' has no attribute 'test'
(pid=None, ip=10.231.21.63) DDDDDD
**(pid=None, ip=10.231.21.63) ['DataLoader', 'DistDataLoader', 'NeighborSampler', 'Process', '__cached__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'argparse', 'compute_acc', 'create_model', 'evaluate', 'load_subtensor', 'register_data_args', 'time']**
(pid=None) DDDDDD
**(pid=None) ['DataLoader', 'DistDataLoader', 'NeighborSampler', 'Process', '__cached__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'argparse', 'compute_acc', 'create_model', 'evaluate', 'load_subtensor', 'register_data_args', 'sage_main_routine', 'test', 'time']**

See the difference between two dir() calls. The master node has “test” and “sage_main_routine” in it, but the slave does not. Looks like an older version of sage.py module get stuck in the system, and we have to clean up cache somewhere? Suggestions?

HuangLED · June 10, 2021, 1:19am

Looks like I have to update the sub-module py file on salve node, to the same path.

Are we supposed to always keep all the sub-module py files sync-up on all nodes?

sangcho · June 16, 2021, 11:51pm

Yes you are right! I believe rsync option in the cluster.yaml does this, but you should make sure all the path is properly set in all nodes. The new runtime env APIs will solve this, but it is not currently GA yet. (Advanced Usage — Ray v2.0.0.dev0)

Topic		Replies	Views
Accessing Ray cluster in AWS Dashboard, Monitoring & Debugging	5	1761	January 29, 2021
Worker node unable to retrieve object Ray Core	2	547	November 30, 2022
Facing issues with @remote function on Python Ray Core	9	1338	April 6, 2021
Debugging inside cv.wait_for() Ray Core	9	38	September 5, 2024
Reload py script when use Cross-Language Programming Ray Client	5	327	July 21, 2023

Why py module not updated on the slave node?

Related topics