How to run the script distributedly?

Kaylenchen · April 23, 2021, 4:25pm

I setup ray cluster by the manual way.

$ ray start --head --port=6379
...
 To connect to this Ray runtime from another node, run
  ray start --address='<ip address>:6379' --redis-password='<password>'

Then，I ran my script at the head node， and the following error occurred.

2021-04-23 20:47:33,509	WARNING worker.py:1107 -- Failed to unpickle the remote function 'sampler_multi.one_episode' with function ID fd6574e8a41423133de16b0bbc6a11c71911311819a020d0ab918b91. Traceback:
Traceback (most recent call last):
  File "/home/temp_user/.conda/envs/cluster/lib/python3.7/site-packages/ray/function_manager.py", line 180, in fetch_and_register_remote_function
    function = pickle.loads(serialized_function)
ModuleNotFoundError: No module named 'sampler_multi'

sampler_multi is one of my script files. I guess the error is caused by no script on the slave node, but I don’t know how to distribute the script to all slave node.
I’m looking forward to your answer. Thanks.

Dmitri · April 26, 2021, 6:13pm

Right now, if you’re setting up a cluster manually, you’d also have to sync code files manually.
There’s ongoing work that will soon allow Ray to handle file syncing internally.

In the meantime, another alternative is to use the Ray autoscaler which has a file_mounts setting for this purpose.

GyChou · April 28, 2021, 4:37am

I am trying to config .yaml to sync code files, but in fact, .yaml only starts the head node, not the worker node in my private cluster.

Is this a problem with my configuration?
And another question is how to sync code files manually, I’m not clear where to put my code in the worker node.

Hope for your reply

there is my yaml:

cluster_name: temp_user
min_workers: 1
initial_workers: 1
max_workers: 1
upscaling_speed: 1.0
idle_timeout_minutes: 5
docker: {}
provider:
    type: local
    head_ip: 172.31.233.205
    worker_ips: [172.31.233.204,]
auth:
    ssh_user: temp_user
available_node_types:
    ray.head.default:
        resources: {}
        min_workers: 0
        max_workers: 0
        node_config: {}
    ray.worker.default:
        resources: {}
        min_workers: 0
        node_config: {}
head_node_type: ray.head.default
file_mounts: {
     "/home/temp_user/remote/repos/ray_elegantrl": "/home/zgy/repos/ray_elegantrl",
}
cluster_synced_files: []
file_mounts_sync_continuously: True
rsync_exclude: []
rsync_filter: []
initialization_commands:
    - >-
      conda activate cluster;
      pip install -U ray==1.2.0
setup_commands:
    - >-
      conda activate cluster;
      pip install -U ray==1.2.0
head_setup_commands:
    - >-
      conda activate cluster;
      pip install -U ray==1.2.0
worker_setup_commands:
    - >-
      conda activate cluster;
      pip install -U ray==1.2.0
head_start_ray_commands:
    - >-
      conda activate cluster;
      ray stop
    - >-
      conda activate cluster;
      ulimit -c unlimited;
      ray start --head --port=9998 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
    - >-
      conda activate cluster;
      ray stop
    - >-
      conda activate cluster;
      ray start --address=$RAY_HEAD_IP:9998
head_node: {}
worker_nodes: {}

Dmitri · April 28, 2021, 4:50pm

Actually, autoscaler support for on-prem local clusters is currently broken…
Progress tracked here:

github.com/ray-project/ray

[autoscaler] Local node provider doesn't work

opened 06:58PM - 15 Apr 21 UTC

DmitriGekhtman

P1 bug

### What is the problem? *Ray version and other system information (Python …version, TensorFlow version, OS):* Local node provider support has been ignored for a while and currently does not work. See e.g. https://discuss.ray.io/t/worker-nodes-not-available-with-manual-configuration/1743 https://discuss.ray.io/t/local-cluster-with-multiple-nodes-in-yaml-config-while-theres-only-head-being-started-any-hints/526 https://discuss.ray.io/t/getting-started-with-rllib-on-a-private-cluster/683 Identify precise issues, fix, update configuration logic, and write tests to prevent regressions. ### Reproduction (REQUIRED) Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have **no external library dependencies** (i.e., use fake or mock data / environments): If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script". - [ ] I have verified my script runs in a clean environment and reproduces the issue. - [ ] I have verified the issue also occurs with the [latest wheels](https://docs.ray.io/en/master/installation.html).

xmzzyo · May 9, 2021, 8:39am

Hi, I used your YAML file. But Ray seems only sync the project folder to the head node. Do I need to mannuly sync project to worker nodes? Thanks!

Topic		Replies	Views
Problems lauching gcp cluster Ray Core	4	729	July 7, 2022
What is the correct way to connect to a Ray cluster? Ray Core	8	1406	January 28, 2021
Cluster dosen't distribute workloads Ray Clusters	0	401	November 10, 2022
Trying to run a cluster at home Ray Clusters	0	405	June 30, 2021
Python API for creating Ray autoscaling cluster? Ray Clusters	1	400	September 24, 2021

How to run the script distributedly?

Related topics