$ ray start --head --port=6379
...
To connect to this Ray runtime from another node, run
ray start --address='<ip address>:6379' --redis-password='<password>'
Then,I ran my script at the head node, and the following error occurred.
2021-04-23 20:47:33,509 WARNING worker.py:1107 -- Failed to unpickle the remote function 'sampler_multi.one_episode' with function ID fd6574e8a41423133de16b0bbc6a11c71911311819a020d0ab918b91. Traceback:
Traceback (most recent call last):
File "/home/temp_user/.conda/envs/cluster/lib/python3.7/site-packages/ray/function_manager.py", line 180, in fetch_and_register_remote_function
function = pickle.loads(serialized_function)
ModuleNotFoundError: No module named 'sampler_multi'
sampler_multi is one of my script files. I guess the error is caused by no script on the slave node, but I don’t know how to distribute the script to all slave node.
I’m looking forward to your answer. Thanks.
Right now, if you’re setting up a cluster manually, you’d also have to sync code files manually.
There’s ongoing work that will soon allow Ray to handle file syncing internally.
In the meantime, another alternative is to use the Ray autoscaler which has a file_mounts setting for this purpose.
I am trying to config .yaml to sync code files, but in fact, .yaml only starts the head node, not the worker node in my private cluster.
Is this a problem with my configuration?
And another question is how to sync code files manually, I’m not clear where to put my code in the worker node.