No training starts although flag is running

Jakub_Mitura · September 20, 2022, 7:16pm

Hello I havepytorch lightning module running on google cloud vm with 24 cpu cores and 2 A100 vidia gpu. I use ray for hyperparameter tuning.

first the information is about 0 per 1 A100 requested

Resources requested: 16.0/24 CPUs, 2.0/2 GPUs, 0.0/107.2 GiB heap, 0.0/49.93 GiB objects (0.0/1.0 accelerator_type:A100)

Then the code for data loading executes ad i get constantly

== Status ==
Current time: 2022-09-20 19:07:03 (running for 00:07:39.81)
Memory usage on this node: 17.7/167.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/24 CPUs, 2.0/2 GPUs, 0.0/107.2 GiB heap, 0.0/49.93 GiB objects (0.0/1.0 accelerator_type:A100)
Result logdir: /home/sliceruser/ray_results/picai-hyperparam-search-30
Number of trials: 1/1 (1 RUNNING)
±----------------------±---------±---------------+
| Trial name | status | loc |
|-----------------------±---------±---------------|
| mainTrain_578ab_00000 | RUNNING | 10.164.0.3:923 |
±----------------------±---------±---------------+

== Status ==
Current time: 2022-09-20 19:07:08 (running for 00:07:44.83)
Memory usage on this node: 17.7/167.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/24 CPUs, 2.0/2 GPUs, 0.0/107.2 GiB heap, 0.0/49.93 GiB objects (0.0/1.0 accelerator_type:A100)
Result logdir: /home/sliceruser/ray_results/picai-hyperparam-search-30
Number of trials: 1/1 (1 RUNNING)
±----------------------±---------±---------------+
| Trial name | status | loc |
|-----------------------±---------±---------------|
| mainTrain_578ab_00000 | RUNNING | 10.164.0.3:923 |
±----------------------±---------±---------------+

tune configuration is defined as below

		    
	    config = {
		        "lr": 1e-3,
		            "dropout": 0.2,
		            "accumulate_grad_batches":  3,
		            "spacing_keyword":  "_one_spac_c" ,#,"_med_spac_b"
		    
		            "gradient_clip_val": 10.0 ,#{"type": "discrete", "values": [0.0, 0.2,0.5,2.0,100.0]},#,2.0, 0.2,0.5
		            "RandGaussianNoised_prob": 0.01,#{"type": "float", "min": 0.0, "max": 0.5},
		            "RandAdjustContrastd_prob": 0.4,#{"type": "float", "min": 0.3, "max": 0.8},
		            "RandGaussianSmoothd_prob": 0.01,#{"type": "discrete", "values": [0.0]},
		            "RandRicianNoised_prob": 0.4,#{"type": "float", "min": 0.2, "max": 0.7},
		            "RandFlipd_prob": 0.4,#{"type": "float", "min": 0.3, "max": 0.7},
		            "RandAffined_prob": 0.2,#{"type": "float", "min": 0.0, "max": 0.5},
		            "RandCoarseDropoutd_prob": 0.01,# {"type": "discrete", "values": [0.0]},
		            "RandomElasticDeformation_prob": 0.1,#{"type": "float", "min": 0.0, "max": 0.3},
		            "RandomAnisotropy_prob": 0.1,# {"type": "float", "min": 0.0, "max": 0.3},
		            "RandomMotion_prob":  0.1,#{"type": "float", "min": 0.0, "max": 0.3},
		            "RandomGhosting_prob": 0.1,# {"type": "float", "min": 0.0, "max": 0.3},
		            "RandomSpike_prob": 0.1,# {"type": "float", "min": 0.0, "max": 0.3},
		            "RandomBiasField_prob": 0.1,# {"type": "float", "min": 0.0, "max": 0.3},
		    
		        
			    }
		    
		    
		    pb2_scheduler = PB2(
		            time_attr="training_iteration",
		            metric='avg_val_acc',
		            mode='max',
		            perturbation_interval=10.0,
		            hyperparam_bounds={
		                "lr": [1e-2, 1e-5],
		                "gradient_clip_val": [0.0,100.0] ,#{"type": "discrete", "values": [0.0, 0.2,0.5,2.0,100.0]},#,2.0, 0.2,0.5
		                "RandGaussianNoised_prob": [0.0,1.0],#{"type": "float", "min": 0.0, "max": 0.5},
		                "RandAdjustContrastd_prob": [0.0,1.0],#{"type": "float", "min": 0.3, "max": 0.8},
		                "RandGaussianSmoothd_prob": [0.0,1.0],#{"type": "discrete", "values": [0.0]},
		                "RandRicianNoised_prob": [0.0,1.0],#{"type": "float", "min": 0.2, "max": 0.7},
		                "RandFlipd_prob":[0.0,1.0],#{"type": "float", "min": 0.3, "max": 0.7},
		                "RandAffined_prob": [0.0,1.0],#{"type": "float", "min": 0.0, "max": 0.5},
		                "RandCoarseDropoutd_prob": [0.0,1.0],# {"type": "discrete", "values": [0.0]},
		                "RandomElasticDeformation_prob":[0.0,1.0],#{"type": "float", "min": 0.0, "max": 0.3},
		                "RandomAnisotropy_prob": [0.0,1.0],# {"type": "float", "min": 0.0, "max": 0.3},
		                "RandomMotion_prob":  [0.0,1.0],#{"type": "float", "min": 0.0, "max": 0.3},
		                "RandomGhosting_prob":[0.0,1.0],# {"type": "float", "min": 0.0, "max": 0.3},
		                "RandomSpike_prob": [0.0,1.0],# {"type": "float", "min": 0.0, "max": 0.3},
		                "RandomBiasField_prob": [0.0,1.0],# {"type": "float", "min": 0.0, "max": 0.3},
		                "dropout": [0.0,0.6],# {"type": "float", "min": 0.0, "max": 0.3},
		            })
		    
		    experiment_name="picai-hyperparam-search-30"
		    # Three_chan_baseline.mainTrain(options,df,experiment_name,dummyDict)
		    num_gpu=2
		    cpu_num=8 #per gpu
		    default_root_dir='/home/sliceruser/data/lightning'
		    checkpoint_dir='/home/sliceruser/data/tuneCheckpoints1'
		    num_cpus_per_worker=cpu_num
		    
		    tuner = tune.Tuner(
		        tune.with_resources(
		            tune.with_parameters(
		                Three_chan_baseline.mainTrain,
		                df=df,
		                experiment_name=experiment_name
		                ,dummyDict=dummyDict
		                ,num_gpu=num_gpu
		                ,cpu_num=cpu_num
		                 ,default_root_dir=default_root_dir
		                 ,checkpoint_dir=checkpoint_dir
		                 ,options=options
		                 ,num_cpus_per_worker=num_cpus_per_worker            
		                ),
		            resources=tune.PlacementGroupFactory(
		                    [{'CPU': num_cpus_per_worker, 'GPU': 1.0}] + [{'CPU': num_cpus_per_worker, 'GPU': 1.0}]
		                )
		        ),
		        tune_config=tune.TuneConfig(
		            # metric="avg_val_acc",
		            # mode="max",
		            scheduler=pb2_scheduler,
		            #num_samples=1#num_gpu,
		        ),
		        run_config=air.RunConfig(
		            name=experiment_name,
		            # progress_reporter=reporter,
		        ),
		        param_space=config,
		        #reuse_actors=True
		    )
		    results = tuner.fit()

pytorch lightning trainer is defined as

  checkPointCallback=TuneReportCheckpointCallback(
        metrics={
            "loss": "avg_val_loss",
            "mean_accuracy": "avg_val_acc"
        },
        filename="checkpoint",
        on="validation_end")

    strategy = RayShardedStrategy(num_workers=num_gpu, num_cpus_per_worker=num_cpus_per_worker, use_gpu=True)

    callbacks=[checkPointCallback]
    kwargs = {
        #"accelerator":'auto',
        "max_epochs": max_epochs,
        "callbacks" :callbacks,
        "logger" : comet_logger,
        "default_root_dir" : default_root_dir,
        "auto_lr_find" : False,
        "check_val_every_n_epoch" : 10,
        "accumulate_grad_batches" : accumulate_grad_batches,
        "gradient_clip_val" :gradient_clip_val,
        "log_every_n_steps" :2,
        "strategy" :strategy
        }

    if checkpoint_dir:
        kwargs["resume_from_checkpoint"] = os.path.join(
            checkpoint_dir, "checkpoint")

    trainer = pl.Trainer(**kwargs)

however no gpu is used below nvidia-smi result

![image](upload://9UsIVDXhZnCrCHis1ijl37flQ3M.png)

full docker container definition available at GitHub - jakubMitura14/forPicaiDocker
full code (quite long) available at GitHub - jakubMitura14/piCaiCode

amogkam · September 21, 2022, 3:50am

Responded on the Github issue! no training starts although flag is running · Issue #216 · ray-project/ray_lightning · GitHub

Topic		Replies	Views
Optimizing Ray Tune for Large-Scale Hyperparameter Search with High Resource Utilization	0	15	December 18, 2024
All trials PENDING, never RUNNING Ray Tune	14	2402	July 14, 2021
Under-utilization of gpus at end of experiment Ray Tune	3	330	December 21, 2022
Multi-gpu ray tune for hparams not parallelizing and only using first gpu	0	78	July 10, 2024
All ray tune trials pending when increasing trainingset size Ray Tune	2	715	November 2, 2022

No training starts although flag is running

Related topics