Hello I havepytorch lightning module running on google cloud vm with 24 cpu cores and 2 A100 vidia gpu. I use ray for hyperparameter tuning.
first the information is about 0 per 1 A100 requested
Resources requested: 16.0/24 CPUs, 2.0/2 GPUs, 0.0/107.2 GiB heap, 0.0/49.93 GiB objects (0.0/1.0 accelerator_type:A100)
Then the code for data loading executes ad i get constantly
== Status ==
Current time: 2022-09-20 19:07:03 (running for 00:07:39.81)
Memory usage on this node: 17.7/167.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/24 CPUs, 2.0/2 GPUs, 0.0/107.2 GiB heap, 0.0/49.93 GiB objects (0.0/1.0 accelerator_type:A100)
Result logdir: /home/sliceruser/ray_results/picai-hyperparam-search-30
Number of trials: 1/1 (1 RUNNING)
±----------------------±---------±---------------+
| Trial name | status | loc |
|-----------------------±---------±---------------|
| mainTrain_578ab_00000 | RUNNING | 10.164.0.3:923 |
±----------------------±---------±---------------+
== Status ==
Current time: 2022-09-20 19:07:08 (running for 00:07:44.83)
Memory usage on this node: 17.7/167.0 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 16.0/24 CPUs, 2.0/2 GPUs, 0.0/107.2 GiB heap, 0.0/49.93 GiB objects (0.0/1.0 accelerator_type:A100)
Result logdir: /home/sliceruser/ray_results/picai-hyperparam-search-30
Number of trials: 1/1 (1 RUNNING)
±----------------------±---------±---------------+
| Trial name | status | loc |
|-----------------------±---------±---------------|
| mainTrain_578ab_00000 | RUNNING | 10.164.0.3:923 |
±----------------------±---------±---------------+
tune configuration is defined as below
config = {
"lr": 1e-3,
"dropout": 0.2,
"accumulate_grad_batches": 3,
"spacing_keyword": "_one_spac_c" ,#,"_med_spac_b"
"gradient_clip_val": 10.0 ,#{"type": "discrete", "values": [0.0, 0.2,0.5,2.0,100.0]},#,2.0, 0.2,0.5
"RandGaussianNoised_prob": 0.01,#{"type": "float", "min": 0.0, "max": 0.5},
"RandAdjustContrastd_prob": 0.4,#{"type": "float", "min": 0.3, "max": 0.8},
"RandGaussianSmoothd_prob": 0.01,#{"type": "discrete", "values": [0.0]},
"RandRicianNoised_prob": 0.4,#{"type": "float", "min": 0.2, "max": 0.7},
"RandFlipd_prob": 0.4,#{"type": "float", "min": 0.3, "max": 0.7},
"RandAffined_prob": 0.2,#{"type": "float", "min": 0.0, "max": 0.5},
"RandCoarseDropoutd_prob": 0.01,# {"type": "discrete", "values": [0.0]},
"RandomElasticDeformation_prob": 0.1,#{"type": "float", "min": 0.0, "max": 0.3},
"RandomAnisotropy_prob": 0.1,# {"type": "float", "min": 0.0, "max": 0.3},
"RandomMotion_prob": 0.1,#{"type": "float", "min": 0.0, "max": 0.3},
"RandomGhosting_prob": 0.1,# {"type": "float", "min": 0.0, "max": 0.3},
"RandomSpike_prob": 0.1,# {"type": "float", "min": 0.0, "max": 0.3},
"RandomBiasField_prob": 0.1,# {"type": "float", "min": 0.0, "max": 0.3},
}
pb2_scheduler = PB2(
time_attr="training_iteration",
metric='avg_val_acc',
mode='max',
perturbation_interval=10.0,
hyperparam_bounds={
"lr": [1e-2, 1e-5],
"gradient_clip_val": [0.0,100.0] ,#{"type": "discrete", "values": [0.0, 0.2,0.5,2.0,100.0]},#,2.0, 0.2,0.5
"RandGaussianNoised_prob": [0.0,1.0],#{"type": "float", "min": 0.0, "max": 0.5},
"RandAdjustContrastd_prob": [0.0,1.0],#{"type": "float", "min": 0.3, "max": 0.8},
"RandGaussianSmoothd_prob": [0.0,1.0],#{"type": "discrete", "values": [0.0]},
"RandRicianNoised_prob": [0.0,1.0],#{"type": "float", "min": 0.2, "max": 0.7},
"RandFlipd_prob":[0.0,1.0],#{"type": "float", "min": 0.3, "max": 0.7},
"RandAffined_prob": [0.0,1.0],#{"type": "float", "min": 0.0, "max": 0.5},
"RandCoarseDropoutd_prob": [0.0,1.0],# {"type": "discrete", "values": [0.0]},
"RandomElasticDeformation_prob":[0.0,1.0],#{"type": "float", "min": 0.0, "max": 0.3},
"RandomAnisotropy_prob": [0.0,1.0],# {"type": "float", "min": 0.0, "max": 0.3},
"RandomMotion_prob": [0.0,1.0],#{"type": "float", "min": 0.0, "max": 0.3},
"RandomGhosting_prob":[0.0,1.0],# {"type": "float", "min": 0.0, "max": 0.3},
"RandomSpike_prob": [0.0,1.0],# {"type": "float", "min": 0.0, "max": 0.3},
"RandomBiasField_prob": [0.0,1.0],# {"type": "float", "min": 0.0, "max": 0.3},
"dropout": [0.0,0.6],# {"type": "float", "min": 0.0, "max": 0.3},
})
experiment_name="picai-hyperparam-search-30"
# Three_chan_baseline.mainTrain(options,df,experiment_name,dummyDict)
num_gpu=2
cpu_num=8 #per gpu
default_root_dir='/home/sliceruser/data/lightning'
checkpoint_dir='/home/sliceruser/data/tuneCheckpoints1'
num_cpus_per_worker=cpu_num
tuner = tune.Tuner(
tune.with_resources(
tune.with_parameters(
Three_chan_baseline.mainTrain,
df=df,
experiment_name=experiment_name
,dummyDict=dummyDict
,num_gpu=num_gpu
,cpu_num=cpu_num
,default_root_dir=default_root_dir
,checkpoint_dir=checkpoint_dir
,options=options
,num_cpus_per_worker=num_cpus_per_worker
),
resources=tune.PlacementGroupFactory(
[{'CPU': num_cpus_per_worker, 'GPU': 1.0}] + [{'CPU': num_cpus_per_worker, 'GPU': 1.0}]
)
),
tune_config=tune.TuneConfig(
# metric="avg_val_acc",
# mode="max",
scheduler=pb2_scheduler,
#num_samples=1#num_gpu,
),
run_config=air.RunConfig(
name=experiment_name,
# progress_reporter=reporter,
),
param_space=config,
#reuse_actors=True
)
results = tuner.fit()
pytorch lightning trainer is defined as
checkPointCallback=TuneReportCheckpointCallback(
metrics={
"loss": "avg_val_loss",
"mean_accuracy": "avg_val_acc"
},
filename="checkpoint",
on="validation_end")
strategy = RayShardedStrategy(num_workers=num_gpu, num_cpus_per_worker=num_cpus_per_worker, use_gpu=True)
callbacks=[checkPointCallback]
kwargs = {
#"accelerator":'auto',
"max_epochs": max_epochs,
"callbacks" :callbacks,
"logger" : comet_logger,
"default_root_dir" : default_root_dir,
"auto_lr_find" : False,
"check_val_every_n_epoch" : 10,
"accumulate_grad_batches" : accumulate_grad_batches,
"gradient_clip_val" :gradient_clip_val,
"log_every_n_steps" :2,
"strategy" :strategy
}
if checkpoint_dir:
kwargs["resume_from_checkpoint"] = os.path.join(
checkpoint_dir, "checkpoint")
trainer = pl.Trainer(**kwargs)
however no gpu is used below nvidia-smi result
![image](upload://9UsIVDXhZnCrCHis1ijl37flQ3M.png)
full docker container definition available at GitHub - jakubMitura14/forPicaiDocker
full code (quite long) available at GitHub - jakubMitura14/piCaiCode