Ray Tune training hangs

My hyperparameter exploration task hangs, with no error and no progress neither. I use ctrl+C to terminate it and it shows like below:

Traceback (most recent call last): File "/home/resnetTrace.py", line 450, in <module>
main(max_num_epochs=100, gpus_per_trial=1) # Change this to activate training on GPUs
File "/home/resnetTrace.py", line 432, in main local_dir=os.path.abspath("/home/lynnl/traces/ray_result"))
File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 419, in run runner.step()
File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 360, in step self._process_events()  # blocking
File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 469, in _process_events trial = self.trial_executor.get_next_available_trial()  # blocking
File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 472, in get_next_available_trial [result_id], _ = ray.wait(shuffled_results)
File "/home/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1513, in wait worker.current_task_id,
File "python/ray/_raylet.pyx", line 1001, in ray._raylet.CoreWorker.wait
File "python/ray/_raylet.pyx", line 142, in ray._raylet.check_status
KeyboardInterrupt

Basically I just use ray.init, and then tune.run to start the training in cluster. I was wondering what the possible reason can be. Thanks!

Hi @kovjxkjvklz, does this do anything at all (i.e. are some trials reporting something) or is there completely no output?

At the location you reported Ray Tune is waiting for trials to report results (e.g. through tune.report()). So it might also be that your trial is processing and just didn’t report any results - e.g. if it takes a long time to fit a model.

What are you trying to tune and how does your trainable look like?

Hi, @kai, I think there’s completely no output for all trials, that’s why it’s weird… And before all trials stop progressing, the trail prints pretty frequently. So here’s the configuration and how I start the training:

config = { 
        "momentum": tune.grid_search([0.5, 0.9]),
        "lr": tune.grid_search([0.1, 0.01]),  
        "batch_size": tune.grid_search([64, 256]),  
        "weight_decay": tune.grid_search([1e-4]),
        "n": tune.grid_search([20, 50, 110]), 
        "activation": tune.grid_search(["relu"]),
        "pool": tune.grid_search(["avg"]),
        "inplanes": tune.grid_search([16]), 
        "opt": tune.grid_search(["O0", "O2"]),  
        "num_iters": tune.grid_search([78000]),
        "multiplier": tune.grid_search([4, 16]),
        "num_worker": tune.grid_search([1]),
        "log_sys_usage": True 
    }

    data_dir = os.path.abspath("/home/data")
    result = tune.run(
        partial(train, data_dir=data_dir),
        name="Result_1_28",
        resources_per_trial={"cpu": 8, "gpu": config["num_worker"]}, 
        config=config, 
        sync_to_driver=False,
        loggers=None,
        local_dir=os.path.abspath("/home/traces/ray_result"))