Ray Tune training hangs

kovjxkjvklz · January 28, 2021, 6:50am

My hyperparameter exploration task hangs, with no error and no progress neither. I use ctrl+C to terminate it and it shows like below:

Traceback (most recent call last): File "/home/resnetTrace.py", line 450, in <module>
main(max_num_epochs=100, gpus_per_trial=1) # Change this to activate training on GPUs
File "/home/resnetTrace.py", line 432, in main local_dir=os.path.abspath("/home/lynnl/traces/ray_result"))
File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 419, in run runner.step()
File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 360, in step self._process_events()  # blocking
File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 469, in _process_events trial = self.trial_executor.get_next_available_trial()  # blocking
File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 472, in get_next_available_trial [result_id], _ = ray.wait(shuffled_results)
File "/home/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1513, in wait worker.current_task_id,
File "python/ray/_raylet.pyx", line 1001, in ray._raylet.CoreWorker.wait
File "python/ray/_raylet.pyx", line 142, in ray._raylet.check_status
KeyboardInterrupt

Basically I just use ray.init, and then tune.run to start the training in cluster. I was wondering what the possible reason can be. Thanks!

kai · January 28, 2021, 8:24am

Hi @kovjxkjvklz, does this do anything at all (i.e. are some trials reporting something) or is there completely no output?

At the location you reported Ray Tune is waiting for trials to report results (e.g. through tune.report()). So it might also be that your trial is processing and just didn’t report any results - e.g. if it takes a long time to fit a model.

What are you trying to tune and how does your trainable look like?

kovjxkjvklz · January 28, 2021, 2:52pm

Hi, @kai, I think there’s completely no output for all trials, that’s why it’s weird… And before all trials stop progressing, the trail prints pretty frequently. So here’s the configuration and how I start the training:

config = { 
        "momentum": tune.grid_search([0.5, 0.9]),
        "lr": tune.grid_search([0.1, 0.01]),  
        "batch_size": tune.grid_search([64, 256]),  
        "weight_decay": tune.grid_search([1e-4]),
        "n": tune.grid_search([20, 50, 110]), 
        "activation": tune.grid_search(["relu"]),
        "pool": tune.grid_search(["avg"]),
        "inplanes": tune.grid_search([16]), 
        "opt": tune.grid_search(["O0", "O2"]),  
        "num_iters": tune.grid_search([78000]),
        "multiplier": tune.grid_search([4, 16]),
        "num_worker": tune.grid_search([1]),
        "log_sys_usage": True 
    }

    data_dir = os.path.abspath("/home/data")
    result = tune.run(
        partial(train, data_dir=data_dir),
        name="Result_1_28",
        resources_per_trial={"cpu": 8, "gpu": config["num_worker"]}, 
        config=config, 
        sync_to_driver=False,
        loggers=None,
        local_dir=os.path.abspath("/home/traces/ray_result"))

ghoshs · October 18, 2023, 3:12am

I am also having similar issues. Were you able to fix it?

Topic		Replies	Views
Last run of a grid search is hanging Ray Tune	1	360	November 1, 2022
Ray tune trials fail due to unexpected worker exit Ray Train	1	305	April 1, 2024
Ray Train hangs for long time Ray Train	11	1761	July 20, 2022
Horovod Trainer hangs Ray Train	5	600	November 3, 2023
Ray.init() hangs Ray Core	2	931	July 8, 2021

Ray Tune training hangs

Related topics