My training is endless wth tune.run()

AhmedM · April 5, 2022, 4:15pm

Hello,

I have a question related to tune.run(). For sanity check l run it with one1 epoch and 2 training samples and 2 validations samples before scaling it to the whole dataset. However it seems that the training is endless. Any cue ?
Here is my code

    analysis = tune.run(
        train_fn_with_parameters,
        metric="loss_validation",
        mode="min",
        config=config
        num_samples=1,  
        resources_per_trial=resources_per_trial,  # 16 cpus and 1. gpu
        name="tune_model",
        max_concurrent_trials=1,
        scheduler=tune_scheduler,
    )

In my screen l have the following:
== Status ==

Current time: 2022-04-06 08:34:53 (running for 00:02:30.33)
Memory usage on this node: 16.8/58.9 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 1.000: None
Resources requested: 14.0/16 CPUs, 1.0/1 GPUs, 0.0/26.36 GiB heap, 0.0/13.18 GiB objects
Result logdir: /home/tune_model
Number of trials: 1/1 (1 RUNNING)

±-------------------------±---------±------------------±----------±--------+
| Trial name | status | loc | kernel| lr |
|--------------------------±---------±------------------±----------±--------|
| run_training_f38ce_00000 | RUNNING | 10.132.0.48:25795 | 16 | 0.32865 |
±-------------------------±---------±------------------±----------±--------+

with running status since several hours where l suppose that the learning should terminate in less than one minute.

    train_fn_with_parameters = ray.tune.function_runner.with_parameters(
        build_model,
        fixed_params=params,
        train_paths=train_paths,
        val_paths=val_paths,
        saving_folder=saving_folder,
        tensorboard=tensorboard,
    )

 tune_scheduler = tune.schedulers.ASHAScheduler(
        max_t=params["nb_epochs"],
        grace_period=1,
    )

Thanks for your help.

amogkam · April 5, 2022, 6:30pm

Hey @AhmedM do you mind also sharing the train_fn_with_parameters function?

As soon as that function finishes, the trial should terminate, so the endless running behavior is a bit odd.

AhmedM · April 6, 2022, 7:39am

Hi @amogkam thank you for your answer I added in the description the train_fn_with_parameters function.

In fixed_params=params, l set params["epochs"]=1

amogkam · April 6, 2022, 7:19pm

Oh @AhmedM can you also share your build_model function, and ideally as much of your code as possible? I mainly want to see how your training logic is defined.

alexxcollins · April 22, 2022, 10:04pm

@amogkam amogkam - this looks similar to a problem I have. I’ve tried with python==3.8.12 and ray==1.12 and also ray==1.11 (I had heard ray 1.12 has problems). Running in Jupyter Lab 3.3.3. I’ve tried with breakout and pong Atari environments. I’m running on JarvisLabs.ai which uses ubuntu 20.04 LTS, and is on Docker (I have very little understanding of how Docker works and whether this could cause problems). Any help or ideas you have would be really really helpful!

This is my code:

import ray
from ray import tune
import ray.rllib.agents.dqn as dqn
from ray.tune.logger import pretty_print
import gc

config = dqn.DEFAULT_CONFIG.copy()
config['env'] = 'PongDeterministic-v4'
config['framework'] = 'torch'
config["dueling"] = False
config["double_q"] = tune.grid_search([True, False])
config['num_atoms'] = 1
config['noisy'] = False
config['prioritized_replay'] = False
config['n_step'] = 1
config['target_network_update_freq'] = 8000
config['lr'] = 0.000625
config['adam_epsilon'] = 0.00015
config['hiddens'] = [512]
config['learning_starts'] = 20000
config['replay_buffer_config']['capacity'] = 1000000 # config['buffer_size'] has been deprecated
config['rollout_fragment_length'] = 4
config['train_batch_size'] = 32
config['exploration_config'] = {'type': 'EpsilonGreedy',
                                'initial_epsilon': 1.0,
                                'final_epsilon': 0.01,
                                'epsilon_timesteps': 200000}
config['prioritized_replay_alpha'] = 0.5
config['num_gpus'] = 0.2
config['num_workers'] = 6 # this depends on number of CPUs available
config['timesteps_per_iteration'] = 10000

def evaluation_fn(result):
    # for tuning
    return result['episode_reward_mean']

def objective_fn(config):
    trainer = dqn.DQNTrainer(config=config)

    for i in range(1):
        # Perform one iteration of training the policy with DQN
        result = trainer.train()
        intermediate_score = evaluation_fn(result)
      
        # Feed the score back back to Tune.
        tune.report(iterations=i, mean_reward=intermediate_score)
        
        if i % 10 == 0 :
            checkpoint = trainer.save()
            print("checkpoint saved at", checkpoint)
            print("cpu utilisation: {:.1%}".format(result['perf']['cpu_util_percent']/100))
            print("ram utilisation: {:.1%}".format(result['perf']['ram_util_percent']/100))

analysis = tune.run(objective_fn,
                    metric="mean_reward",
                    mode="max",
                    num_samples=1,
                    config=config)

When I tune with minimal search size and small number of iterations, it looks like tune.run() is working fine, with regular Status updates printed. However, after a while it just hangs and doesn’t seem to do anything. This is an example of the last message I see:

alexxcollins · April 22, 2022, 10:06pm

I’m only allowed to embed one image per post: sorry!

I do have a few warning messages when first running the cell, as follows:

(I tried importing gputil, but that actually crashed the notebook for some reason)

alexxcollins · April 22, 2022, 10:08pm

Final warning I get after the very first status update is below. Then everything appears to run smoothly except it just stops updating.

xwjiang2010 · April 24, 2022, 9:42pm

Hi @alexxcollins
Do you mind opening a new thread for this issue? Hanging could be of different reasons - it’s better that we keep the discussions separate.

I am looking at your script. I am curious about any specific reason you write a custom fn rather than directly using tune.run("DQN", ...)? Any supported RLlib algorithm is pre-registered and ready to be used in tune runs like this. It may be more straightforward and less error prone.

Christian_Coletti · May 3, 2022, 1:44am

I’m also having this problem with tune.run("PPO" ... )

Topic		Replies	Views
Ray rllib tune.run() stuck in running RLlib	2	356	May 24, 2023
Ray Tune gets stuck for infinity Ray Tune	7	33	May 5, 2025
Ray Tune process hangs - thread synchronization issue? Ray Tune	16	767	February 10, 2023
Ray tune self terminates at 98 trials consistently Ray Tune	12	1358	March 15, 2023
Tune.run not executing actual trials Ray Tune	2	458	January 3, 2022

My training is endless wth tune.run()

Related topics