@rliaw totally appreciated your quick answer.
I think there are multiple errors here.
After enabling verbosity, it maybe turned out what is causing the error…
First, for anyone who has the same error later: I wasn’t totally sure where to set the env variable, so I set it in ~/.bashrc on the server machine add a line at the end:
TUNE_SYNCER_VERBOSITY=3
and also in the python script:
import os
os.environ[“TUNE_SYNCER_VERBOSITY”] = “3”
and it turned out, the python script env set would be enough.
Anyway, here is the error for the syncer:
click.exceptions.ClickException: SSH command failed.
Trial run_for_one_param_and_yield_2938b_00081 completed. Last result:
2021-03-15 05:30:29,071 INFO command_runner.py:357 -- Fetched IP: 172.31.29.20
2021-03-15 05:30:29,071 INFO log_timer.py:27 -- NodeUpdater: i-06a46443bc950cfc0: Got IP [LogTimer=60ms]
2021-03-15 05:30:29,072 VINFO command_runner.py:509 -- Running `mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00`
2021-03-15 05:30:29,072 VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.29.20 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00)'`
Shared connection to 172.31.29.20 closed.
2021-03-15 05:30:29,425 VINFO command_runner.py:509 -- Running `docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00/run_for_one_param_and_yield_2938b_00081_81_agdata_ref=ObjectRef(ffffffffffffffffffffffffffffffffffffffff0500000002000000),col_comb_2021-03-15_05-30-27/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00/run_for_one_param_and_yield_2938b_00081_81_agdata_ref=ObjectRef(ffffffffffffffffffffffffffffffffffffffff0500000002000000),col_comb_2021-03-15_05-30-27/`
2021-03-15 05:30:29,426 VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.29.20 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00/run_for_one_param_and_yield_2938b_00081_81_agdata_ref=ObjectRef(ffffffffffffffffffffffffffffffffffffffff0500000002000000),col_comb_2021-03-15_05-30-27/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00/run_for_one_param_and_yield_2938b_00081_81_agdata_ref=ObjectRef(ffffffffffffffffffffffffffffffffffffffff0500000002000000),col_comb_2021-03-15_05-30-27/)'`
**bash: syntax error near unexpected token `('**
Shared connection to 172.31.29.20 closed.
2021-03-15 05:30:29,644 ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 187, in sync_down
self._local_dir)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/integration/docker.py", line 108, in sync_down
use_internal_ip=True)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/sdk.py", line 147, in rsync
all_nodes=False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1054, in rsync
rsync_to_node(node_id, is_head_node=(node_id == head_node))
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1037, in rsync_to_node
rsync(source, target, is_file_mount)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 477, in rsync_down
self.cmd_runner.run_rsync_down(source, target, options=options)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 679, in run_rsync_down
silent=is_rsync_silent())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 517, in run
final_cmd, with_output, exit_on_fail, silent=silent)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 451, in _run_helper
raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.
As far as I can see it fails, because the directory has a special character ‘(’ in the path, and the script should escape it with a backslash but it does not. Now you may ask: why is there a ObjectRef(…) in that directory name. I guess it is because I pass a ray reference as a search space parameter to the function! I know that is kind of a violation of the rules, but I had no other choice. Somehow if I used tune.with_parameters() to pass multiple parameters in a normal way, I got another error: KeyError: ‘trial_id’.
Example code for the latter error:
import pandas as pd
import numpy as np
import ray
from ray import tune
from itertools import combinations
if ray.is_initialized() == False:
import multiprocessing
num_cores = multiprocessing.cpu_count()
ray.init(num_cpus=num_cores)
def run_for_one_param(params,agdata=None):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# I know this is a but ugly, agdata_ref should not be in params, but we have multiple of these dynamic paramters.
#agdata=ray.get(params["agdata_ref"])
df=agdata.data
y=df.pop("y")
X=df
#for i in range(10):
#for i in range(100):
linreg=LinearRegression()
#I should use train test validation split I know...
linreg.fit(X,y)
pred_y=linreg.predict(X)
report_dict = dict()
report_dict["score_r2"]=r2_score(y,pred_y)
return report_dict
def run_for_one_param_and_yield(params,agdata=None):
result=run_for_one_param(params,agdata)
print(result)
return result
class AGData():
def __init__(self):
self.storage_dict=dict()
def set_to_storage(self,key,value):
self.storage_dict[key] = value
def get_from_storage(self,key):
return self.storage_dict[key]
def set_data(self,data):
self.data=data
pass
def generate_combinations(features,min_len=1,max_len=10):
all_comb=[]
from itertools import combinations
for i in range(min_len-1,max_len):
comb=combinations(features, i+1)
for act_comb in list(comb):
all_comb.append(list(act_comb))
return all_comb
class Runray():
def __init__(self):
self.agdata=AGData()
#Add here new letters to grow trial numbers!!!!!
self.colums='ABCD' #EFGHIJ'
df = pd.DataFrame(np.random.randint(0,100,size=(100, len(self.colums)+1)), columns=list(self.colums+"y"))
self.agdata.set_data(df)
def get_search_space_ray(self):
combinations=generate_combinations(list(self.colums))
print("nr trial will run: ",len(combinations))
search_space = {
'col_combinations': tune.grid_search(combinations),
#'other params': tune.grid_search(some_list)
}
#agdata_ref=ray.put(self.agdata)
#print("new agdata reference put into params.",agdata_ref)
# I know this is totaly ugly, but this is how we pass the reference now:
#search_space["agdata_ref"]=tune.choice([agdata_ref])
return search_space
def run_ray(self):
# RUN RAY
analysis = tune.run(
tune.with_parameters(run_for_one_param_and_yield,agdata=self.agdata)
, config=self.get_search_space_ray()
, num_samples=1
,verbose=1
#,checkpoint_freq=500
#,checkpoint_at_end=False
#,keep_checkpoints_num=10
)
print("ray.version: ",ray.__version__)
print(analysis.results_df)
Runray().run_ray()
So that leads to this output with the error:
== Status ==
Memory usage on this node: 1.4/15.2 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/68 CPUs, 0/0 GPUs, 0.0/92.41 GiB heap, 0.0/40.24 GiB objects
Result logdir: /home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02
Number of trials: 15/15 (15 TERMINATED)
2021-03-15 05:54:22,085 INFO tune.py:549 -- Total run time: 19.25 seconds (18.40 seconds for the tuning loop).
ray.version: 2.0.0.dev0
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
**KeyError: 'trial_id'**
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ray/ray_example.py", line 135, in <module>
Runray().run_ray()
File "/home/ray/ray_example.py", line 132, in run_ray
print(analysis.results_df)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py", line 508, in results_df
index="trial_id")
File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 1871, in from_records
i = columns.get_loc(index)
File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'trial_id'
Shared connection to 34.218.81.250 closed.
Error: Command failed:
ssh -tt -i /Users/miklostoth/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_9fee58ce4e/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.218.81.250 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/ray_example.py)'"'"'"'"'"'"'"'"''"'"' )'
BTW, even If I use this example script above leads to a similar error with the sync.
(pid=None, ip=172.31.36.188) {'score_r2': 0.02295957796592607}
(pid=None, ip=172.31.36.188) {'score_r2': 0.02295957796592607}
(pid=None, ip=172.31.33.254) {'score_r2': 0.02295957796592607}
(pid=None, ip=172.31.36.188) {'score_r2': 0.02295957796592607}
(pid=None, ip=172.31.36.188) {'score_r2': 0.02295957796592607}
2021-03-15 05:54:05,511 INFO command_runner.py:357 -- Fetched IP: 172.31.46.205
2021-03-15 05:54:05,512 INFO log_timer.py:27 -- NodeUpdater: i-09be5ad7129024265: Got IP [LogTimer=0ms]
2021-03-15 05:54:05,512 VINFO command_runner.py:509 -- Running `mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02`
2021-03-15 05:54:05,512 VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.46.205 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02)'`
Warning: Permanently added '172.31.46.205' (ECDSA) to the list of known hosts.
Shared connection to 172.31.46.205 closed.
2021-03-15 05:54:06,391 VINFO command_runner.py:509 -- Running `docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=['A']_2021-03-15_05-54-03/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=['A']_2021-03-15_05-54-03/`
2021-03-15 05:54:06,391 VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.46.205 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=['"'"'A'"'"']_2021-03-15_05-54-03/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=['"'"'A'"'"']_2021-03-15_05-54-03/)'`
Error: No such container:path: ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=[A]_2021-03-15_05-54-03/.
Shared connection to 172.31.46.205 closed.
2021-03-15 05:54:06,770 ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 187, in sync_down
self._local_dir)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/integration/docker.py", line 108, in sync_down
use_internal_ip=True)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/sdk.py", line 147, in rsync
all_nodes=False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1054, in rsync
rsync_to_node(node_id, is_head_node=(node_id == head_node))
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1037, in rsync_to_node
rsync(source, target, is_file_mount)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 477, in rsync_down
self.cmd_runner.run_rsync_down(source, target, options=options)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 679, in run_rsync_down
silent=is_rsync_silent())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 517, in run
final_cmd, with_output, exit_on_fail, silent=silent)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 451, in _run_helper
raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.
Trial run_for_one_param_and_yield_84b87_00010 completed. Last result:
2021-03-15 05:54:06,926 INFO command_runner.py:357 -- Fetched IP: 172.31.44.39
2021-03-15 05:54:06,926 INFO log_timer.py:27 -- NodeUpdater: i-0d2d7c6c2b3b12ed9: Got IP [LogTimer=69ms]
2021-03-15 05:54:06,927 VINFO command_runner.py:509 -- Running `mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02`
2021-03-15 05:54:06,927 VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.44.39 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02)'`
Warning: Permanently added '172.31.44.39' (ECDSA) to the list of known hosts.
Shared connection to 172.31.44.39 closed.
2021-03-15 05:54:07,825 VINFO command_runner.py:509 -- Running `docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00010_10_col_combinations=['A', 'B', 'C']_2021-03-15_05-54-04/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00010_10_col_combinations=['A', 'B', 'C']_2021-03-15_05-54-04/`
2021-03-15 05:54:07,826 VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.44.39 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00010_10_col_combinations=['"'"'A'"'"', '"'"'B'"'"', '"'"'C'"'"']_2021-03-15_05-54-04/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00010_10_col_combinations=['"'"'A'"'"', '"'"'B'"'"', '"'"'C'"'"']_2021-03-15_05-54-04/)'`
**"docker cp" requires exactly 2 arguments.**
See 'docker cp --help'.
Usage: docker cp [OPTIONS] CONTAINER:SRC_PATH DEST_PATH|-
docker cp [OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH
Copy files/folders between a container and the local filesystem
Shared connection to 172.31.44.39 closed.
2021-03-15 05:54:08,218 ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 187, in sync_down
self._local_dir)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/integration/docker.py", line 108, in sync_down
use_internal_ip=True)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/sdk.py", line 147, in rsync
all_nodes=False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1054, in rsync
rsync_to_node(node_id, is_head_node=(node_id == head_node))
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1037, in rsync_to_node
rsync(source, target, is_file_mount)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 477, in rsync_down
self.cmd_runner.run_rsync_down(source, target, options=options)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 679, in run_rsync_down
silent=is_rsync_silent())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 517, in run
final_cmd, with_output, exit_on_fail, silent=silent)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 451, in _run_helper
raise click.ClickException(fail_msg) from None
`click.exceptions.ClickException: SSH command failed.`
So what do I do wrong here?