[Tune] Error when using docker containers and Sync

When using

ray up

to create an AWS cluster I’m having some trouble getting my log files to sync from the workers node(s) to the head node. Previously I received the error outlined in this thread. Having since incorporated the DockerSyncer and moved to the nightly release I have the following:

2020-12-15 18:29:34,543 VINFO command_runner.py:474 – Running mkdir -p /tmp/ray_tmp_mount/eta-trainer/root/ray_results/test_easy_ox_093808000_082494000
2020-12-15 18:29:34,543 VVINFO command_runner.py:477 – Full command is ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/0c606eb3ac/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@10.1.23.105 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/eta-trainer/root/ray_results/test_easy_ox_093808000_082494000)'
Shared connection to 10.1.23.105 closed.
2020-12-15 18:29:34,639 VINFO command_runner.py:474 – Running docker cp eta-trainer:/root/ray_results/test_easy_ox_093808000_082494000/SequenceTrainable_71ab973a_1_batch_size=1024,debug=False,destination=082494000,device=cuda,dropout=0.35565,embedding_size=9,kernel_2020-12-15_18-29-18/. /tmp/ray_tmp_mount/eta-trainer/root/ray_results/test_easy_ox_093808000_082494000/SequenceTrainable_71ab973a_1_batch_size=1024,debug=False,destination=082494000,device=cuda,dropout=0.35565,embedding_size=9,kernel_2020-12-15_18-29-18/
2020-12-15 18:29:34,639 VVINFO command_runner.py:477 – Full command is ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/0c606eb3ac/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@10.1.23.105 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker cp eta-trainer:/root/ray_results/test_easy_ox_093808000_082494000/SequenceTrainable_71ab973a_1_batch_size=1024,debug=False,destination=082494000,device=cuda,dropout=0.35565,embedding_size=9,kernel_2020-12-15_18-29-18/. /tmp/ray_tmp_mount/eta-trainer/root/ray_results/test_easy_ox_093808000_082494000/SequenceTrainable_71ab973a_1_batch_size=1024,debug=False,destination=082494000,device=cuda,dropout=0.35565,embedding_size=9,kernel_2020-12-15_18-29-18/)'
invalid output path: directory “/tmp/ray_tmp_mount/eta-trainer/root/ray_results/test_easy_ox_093808000_082494000/SequenceTrainable_71ab973a_1_batch_size=1024,debug=False,destination=082494000,device=cuda,dropout=0.35565,embedding_size=9,kernel_2020-12-15_18-29-18” does not exist
Shared connection to 10.1.23.105 closed.
2020-12-15 18:29:34,789 ERROR syncer.py:181 – Sync execution failed.
Traceback (most recent call last):
File “/opt/conda/lib/python3.6/site-packages/ray/tune/syncer.py”, line 178, in sync_down
self._local_dir)
File “/opt/conda/lib/python3.6/site-packages/ray/tune/integration/docker.py”, line 101, in sync_down
use_internal_ip=True)
File “/opt/conda/lib/python3.6/site-packages/ray/autoscaler/sdk.py”, line 139, in rsync
all_nodes=False)
File “/opt/conda/lib/python3.6/site-packages/ray/autoscaler/_private/commands.py”, line 953, in rsync
rsync_to_node(node_id, is_head_node=(node_id == head_node))
File “/opt/conda/lib/python3.6/site-packages/ray/autoscaler/_private/commands.py”, line 936, in rsync_to_node
rsync(source, target, is_file_mount)
File “/opt/conda/lib/python3.6/site-packages/ray/autoscaler/_private/updater.py”, line 427, in rsync_down
self.cmd_runner.run_rsync_down(source, target, options=options)
File “/opt/conda/lib/python3.6/site-packages/ray/autoscaler/_private/command_runner.py”, line 626, in run_rsync_down
host_source))
File “/opt/conda/lib/python3.6/site-packages/ray/autoscaler/_private/command_runner.py”, line 481, in run
return self._run_helper(final_cmd, with_output, exit_on_fail)
File “/opt/conda/lib/python3.6/site-packages/ray/autoscaler/_private/command_runner.py”, line 416, in _run_helper
raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

Note that the start of the logs indicate the directory has been created.

Further confusing is I have tried to create a minimal example using example_full.yaml and a simple hyperband example. However I am not only unable to reproduce the error, but can omit the DockerSyncer altogether without errors. Further, I do not see INFO statements related to sync operations.

Any help is appreciated!

According to the source code comments, adding a period at the end of the directory copies the contents of a directory. From the shared logs:

docker cp eta-trainer:/root/ray_results/test_easy_ox_093808000_082494000/SequenceTrainable_71ab973a_1_batch_size=1024,debug=False,destination=082494000,device=cuda,dropout=0.35565,embedding_size=9,kernel_2020-12-15_18-29-18/.

However it is attempting to copy the contents of the trial:

SequenceTrainable_71ab973a_1_batch_size=1024,debug=False,destination=082494000,device=cuda,dropout=0.35565,embedding_size=9,kernel_2020-12-15_18-29-18

into the directory:

/tmp/ray_tmp_mount/eta-trainer/root/ray_results/test_easy_ox_093808000_082494000/SequenceTrainable_71ab973a_1_batch_size=1024,debug=False,destination=082494000,device=cuda,dropout=0.35565,embedding_size=9,kernel_2020-12-15_18-29-18/

where in a previous INFO statement we had only created:

mkdir -p /tmp/ray_tmp_mount/eta-trainer/root/ray_results/test_easy_ox_093808000_082494000

and thus the trial directory does not exist. It feels like we need to create the trial directory as well? The statement on master in command_runner.py occurs on line 644.

BTW, @Christopher_Fusting what version of Ray are you on? and also, what does your SequenceTrainable checkpoint look like?

@rliaw The version is nightly. Sequence trainable outputs a simple pytorch model.pth in save checkpoint.

Speaking of versions I noticed my version of Ax was 0.1.9 versus ray-ml’s version of Ax 0.1.18. give it I was not able to reproduce the error when using that image and a toy a example we’re going to try using ray-ml as a base to see if that resolves the issue.

Hmm, ok – that’s odd; do feel free to follow up if you have any issues, @Christopher_Fusting.

@rliaw Once we moved to the ray images the error was resolved. Feels like it was a dependency issue. On that note, thanks for providing images and Dockerfiles! Huge help (also in building a toy env / example).

Hi,
I have very similar error>

  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 520, in run
    final_cmd, with_output, exit_on_fail, silent=silent)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 451, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.
2021-03-14 05:55:15,056	WARNING util.py:162 -- The `callbacks.on_trial_result` operation took 0.836 s, which may be a performance bottleneck.
2021-03-14 05:55:15,062	WARNING util.py:162 -- The `process_trial_result` operation took 0.845 s, which may be a performance bottleneck.
2021-03-14 05:55:15,092	INFO logger.py:690 -- Removed the following hyperparameter values when logging to tensorboard: {'agdata_ref': ObjectRef(ffffffffffffffffffffffffffffffffffffffff0200000018000000), 'pipeline_elements_ref': ObjectRef(ffffffffffffffffffffffffffffffffffffffff0200000019000000), 'debug_ref': ObjectRef(ffffffffffffffffffffffffffffffffffffffff020000001a000000)}
2021-03-14 05:55:15,234	INFO command_runner.py:357 -- Fetched IP: 172.31.24.179
2021-03-14 05:55:15,234	INFO log_timer.py:27 -- NodeUpdater: i-02817c4582788593b: Got IP  [LogTimer=60ms]
2021-03-14 05:55:15,842	ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 187, in sync_down
    self._local_dir)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/integration/docker.py", line 108, in sync_down
    use_internal_ip=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/sdk.py", line 147, in rsync
    all_nodes=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1054, in rsync
    rsync_to_node(node_id, is_head_node=(node_id == head_node))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1037, in rsync_to_node
    rsync(source, target, is_file_mount)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 477, in rsync_down
    self.cmd_runner.run_rsync_down(source, target, options=options)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 679, in run_rsync_down
    silent=is_rsync_silent())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 520, in run
    final_cmd, with_output, exit_on_fail, silent=silent)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 451, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

running the lastest nightly version. First doing ‘ray up’ with a small cluster using a modified version the example yaml file, then submitting my python script. I got the error above after each trial.

What could be wrong? How to check the versions oof lib Ax ? I dont see such library in pip freeze.

cc: @rliaw

Can you try setting TUNE_SYNCER_VERBOSITY=3 as an env var and post the logging output
?

@rliaw totally appreciated your quick answer.
I think there are multiple errors here.

After enabling verbosity, it maybe turned out what is causing the error…

First, for anyone who has the same error later: I wasn’t totally sure where to set the env variable, so I set it in ~/.bashrc on the server machine add a line at the end:
TUNE_SYNCER_VERBOSITY=3
and also in the python script:
import os
os.environ[“TUNE_SYNCER_VERBOSITY”] = “3”

and it turned out, the python script env set would be enough.

Anyway, here is the error for the syncer:

click.exceptions.ClickException: SSH command failed.
Trial run_for_one_param_and_yield_2938b_00081 completed. Last result:
2021-03-15 05:30:29,071	INFO command_runner.py:357 -- Fetched IP: 172.31.29.20
2021-03-15 05:30:29,071	INFO log_timer.py:27 -- NodeUpdater: i-06a46443bc950cfc0: Got IP  [LogTimer=60ms]
2021-03-15 05:30:29,072	VINFO command_runner.py:509 -- Running `mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00`
2021-03-15 05:30:29,072	VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.29.20 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00)'`
Shared connection to 172.31.29.20 closed.
2021-03-15 05:30:29,425	VINFO command_runner.py:509 -- Running `docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00/run_for_one_param_and_yield_2938b_00081_81_agdata_ref=ObjectRef(ffffffffffffffffffffffffffffffffffffffff0500000002000000),col_comb_2021-03-15_05-30-27/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00/run_for_one_param_and_yield_2938b_00081_81_agdata_ref=ObjectRef(ffffffffffffffffffffffffffffffffffffffff0500000002000000),col_comb_2021-03-15_05-30-27/`
2021-03-15 05:30:29,426	VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.29.20 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00/run_for_one_param_and_yield_2938b_00081_81_agdata_ref=ObjectRef(ffffffffffffffffffffffffffffffffffffffff0500000002000000),col_comb_2021-03-15_05-30-27/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-30-00/run_for_one_param_and_yield_2938b_00081_81_agdata_ref=ObjectRef(ffffffffffffffffffffffffffffffffffffffff0500000002000000),col_comb_2021-03-15_05-30-27/)'`
**bash: syntax error near unexpected token `('**
Shared connection to 172.31.29.20 closed.
2021-03-15 05:30:29,644	ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 187, in sync_down
    self._local_dir)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/integration/docker.py", line 108, in sync_down
    use_internal_ip=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/sdk.py", line 147, in rsync
    all_nodes=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1054, in rsync
    rsync_to_node(node_id, is_head_node=(node_id == head_node))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1037, in rsync_to_node
    rsync(source, target, is_file_mount)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 477, in rsync_down
    self.cmd_runner.run_rsync_down(source, target, options=options)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 679, in run_rsync_down
    silent=is_rsync_silent())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 517, in run
    final_cmd, with_output, exit_on_fail, silent=silent)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 451, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

As far as I can see it fails, because the directory has a special character ‘(’ in the path, and the script should escape it with a backslash but it does not. Now you may ask: why is there a ObjectRef(…) in that directory name. I guess it is because I pass a ray reference as a search space parameter to the function! I know that is kind of a violation of the rules, but I had no other choice. Somehow if I used tune.with_parameters() to pass multiple parameters in a normal way, I got another error: KeyError: ‘trial_id’.

Example code for the latter error:

import pandas as pd
import numpy as np
import ray
from ray import tune
from itertools import combinations

if ray.is_initialized() == False:
    import multiprocessing    
    num_cores = multiprocessing.cpu_count()
    ray.init(num_cpus=num_cores)
    

    
def run_for_one_param(params,agdata=None):
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import r2_score
    
    # I know this is a but ugly, agdata_ref should not be in params, but we have multiple of these dynamic paramters.
    #agdata=ray.get(params["agdata_ref"])
    df=agdata.data
    y=df.pop("y")
    X=df
    #for i in range(10):
    #for i in range(100):
    linreg=LinearRegression()
    #I should use train test validation split I know...
    linreg.fit(X,y)
    pred_y=linreg.predict(X)

    report_dict = dict()
    report_dict["score_r2"]=r2_score(y,pred_y)
    return report_dict

def run_for_one_param_and_yield(params,agdata=None):
    result=run_for_one_param(params,agdata)
    print(result)
    return result

class AGData():

    def __init__(self):
        self.storage_dict=dict()
        
    def set_to_storage(self,key,value):
        self.storage_dict[key] = value

    def get_from_storage(self,key):
        return self.storage_dict[key]
    
    def set_data(self,data):
        self.data=data  
        
    pass

def generate_combinations(features,min_len=1,max_len=10):
    all_comb=[]
    from itertools import combinations
    for i in range(min_len-1,max_len):
        comb=combinations(features, i+1)
        for act_comb in list(comb):
            all_comb.append(list(act_comb))

    return all_comb

class Runray():
    def __init__(self):
        self.agdata=AGData()
        #Add here new letters to grow trial numbers!!!!!
        self.colums='ABCD' #EFGHIJ'
        df = pd.DataFrame(np.random.randint(0,100,size=(100, len(self.colums)+1)), columns=list(self.colums+"y"))
        self.agdata.set_data(df)
    
    def get_search_space_ray(self):
        
        combinations=generate_combinations(list(self.colums))
        print("nr trial will run: ",len(combinations))
        search_space = {
            'col_combinations': tune.grid_search(combinations),
            #'other params': tune.grid_search(some_list)
        }
        
        #agdata_ref=ray.put(self.agdata)
        #print("new agdata reference put into params.",agdata_ref)
        # I know this is totaly ugly, but this is how we pass the reference now: 
        #search_space["agdata_ref"]=tune.choice([agdata_ref])
        return search_space
    
    def run_ray(self):
        # RUN RAY
        analysis = tune.run(
            tune.with_parameters(run_for_one_param_and_yield,agdata=self.agdata)
            , config=self.get_search_space_ray()
            , num_samples=1
            ,verbose=1
            #,checkpoint_freq=500
            #,checkpoint_at_end=False
            #,keep_checkpoints_num=10
        )
        
        print("ray.version: ",ray.__version__)
        print(analysis.results_df)
        
        
Runray().run_ray()

So that leads to this output with the error:

== Status ==
Memory usage on this node: 1.4/15.2 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/68 CPUs, 0/0 GPUs, 0.0/92.41 GiB heap, 0.0/40.24 GiB objects
Result logdir: /home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02
Number of trials: 15/15 (15 TERMINATED)


2021-03-15 05:54:22,085	INFO tune.py:549 -- Total run time: 19.25 seconds (18.40 seconds for the tuning loop).
ray.version:  2.0.0.dev0
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
**KeyError: 'trial_id'**

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ray/ray_example.py", line 135, in <module>
    Runray().run_ray()
  File "/home/ray/ray_example.py", line 132, in run_ray
    print(analysis.results_df)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/analysis/experiment_analysis.py", line 508, in results_df
    index="trial_id")
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 1871, in from_records
    i = columns.get_loc(index)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'trial_id'
Shared connection to 34.218.81.250 closed.
Error: Command failed:

  ssh -tt -i /Users/miklostoth/.ssh/ray-autoscaler_us-west-2.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_9fee58ce4e/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.218.81.250 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/ray_example.py)'"'"'"'"'"'"'"'"''"'"' )'

BTW, even If I use this example script above leads to a similar error with the sync.

(pid=None, ip=172.31.36.188) {'score_r2': 0.02295957796592607}
(pid=None, ip=172.31.36.188) {'score_r2': 0.02295957796592607}
(pid=None, ip=172.31.33.254) {'score_r2': 0.02295957796592607}
(pid=None, ip=172.31.36.188) {'score_r2': 0.02295957796592607}
(pid=None, ip=172.31.36.188) {'score_r2': 0.02295957796592607}
2021-03-15 05:54:05,511	INFO command_runner.py:357 -- Fetched IP: 172.31.46.205
2021-03-15 05:54:05,512	INFO log_timer.py:27 -- NodeUpdater: i-09be5ad7129024265: Got IP  [LogTimer=0ms]
2021-03-15 05:54:05,512	VINFO command_runner.py:509 -- Running `mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02`
2021-03-15 05:54:05,512	VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.46.205 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02)'`
Warning: Permanently added '172.31.46.205' (ECDSA) to the list of known hosts.
Shared connection to 172.31.46.205 closed.
2021-03-15 05:54:06,391	VINFO command_runner.py:509 -- Running `docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=['A']_2021-03-15_05-54-03/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=['A']_2021-03-15_05-54-03/`
2021-03-15 05:54:06,391	VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.46.205 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=['"'"'A'"'"']_2021-03-15_05-54-03/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=['"'"'A'"'"']_2021-03-15_05-54-03/)'`
Error: No such container:path: ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00000_0_col_combinations=[A]_2021-03-15_05-54-03/.
Shared connection to 172.31.46.205 closed.
2021-03-15 05:54:06,770	ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 187, in sync_down
    self._local_dir)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/integration/docker.py", line 108, in sync_down
    use_internal_ip=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/sdk.py", line 147, in rsync
    all_nodes=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1054, in rsync
    rsync_to_node(node_id, is_head_node=(node_id == head_node))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1037, in rsync_to_node
    rsync(source, target, is_file_mount)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 477, in rsync_down
    self.cmd_runner.run_rsync_down(source, target, options=options)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 679, in run_rsync_down
    silent=is_rsync_silent())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 517, in run
    final_cmd, with_output, exit_on_fail, silent=silent)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 451, in _run_helper
    raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.
Trial run_for_one_param_and_yield_84b87_00010 completed. Last result:
2021-03-15 05:54:06,926	INFO command_runner.py:357 -- Fetched IP: 172.31.44.39
2021-03-15 05:54:06,926	INFO log_timer.py:27 -- NodeUpdater: i-0d2d7c6c2b3b12ed9: Got IP  [LogTimer=69ms]
2021-03-15 05:54:06,927	VINFO command_runner.py:509 -- Running `mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02`
2021-03-15 05:54:06,927	VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.44.39 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02 && chown -R ubuntu /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02)'`
Warning: Permanently added '172.31.44.39' (ECDSA) to the list of known hosts.
Shared connection to 172.31.44.39 closed.
2021-03-15 05:54:07,825	VINFO command_runner.py:509 -- Running `docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00010_10_col_combinations=['A', 'B', 'C']_2021-03-15_05-54-04/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00010_10_col_combinations=['A', 'B', 'C']_2021-03-15_05-54-04/`
2021-03-15 05:54:07,826	VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/4ea88e9fa8/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@172.31.44.39 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker cp ray_container:/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00010_10_col_combinations=['"'"'A'"'"', '"'"'B'"'"', '"'"'C'"'"']_2021-03-15_05-54-04/. /tmp/ray_tmp_mount/alphagen_ray/home/ray/ray_results/run_for_one_param_and_yield_2021-03-15_05-54-02/run_for_one_param_and_yield_84b87_00010_10_col_combinations=['"'"'A'"'"', '"'"'B'"'"', '"'"'C'"'"']_2021-03-15_05-54-04/)'`
**"docker cp" requires exactly 2 arguments.**
See 'docker cp --help'.

Usage:  docker cp [OPTIONS] CONTAINER:SRC_PATH DEST_PATH|-
	docker cp [OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH

Copy files/folders between a container and the local filesystem
Shared connection to 172.31.44.39 closed.
2021-03-15 05:54:08,218	ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 187, in sync_down
    self._local_dir)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/integration/docker.py", line 108, in sync_down
    use_internal_ip=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/sdk.py", line 147, in rsync
    all_nodes=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1054, in rsync
    rsync_to_node(node_id, is_head_node=(node_id == head_node))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1037, in rsync_to_node
    rsync(source, target, is_file_mount)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/updater.py", line 477, in rsync_down
    self.cmd_runner.run_rsync_down(source, target, options=options)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 679, in run_rsync_down
    silent=is_rsync_silent())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 517, in run
    final_cmd, with_output, exit_on_fail, silent=silent)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 451, in _run_helper
    raise click.ClickException(fail_msg) from None
`click.exceptions.ClickException: SSH command failed.`

So what do I do wrong here?

Hi,

in the meantime, I debugged a bit and turned out that:
if you use

analysis = tune.run(
            tune.with_parameters(your_func),
   ...
   )

Then you MUST use

tune.report(metric=value)

and cannot use

yield {metric:value}

at the end of your
otherwise, you will get an error during accessing analysis.results_df

if you run

analysis = tune.run(
            your_func,
   ...
   )

there is no such issue, you can use yield