How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello,
I am using Ray 2.2.0 on a slurm cluster but have several problems regarding correct resource allocation. In this exemple, I reserve 4 nodes each which 10 CPUs and 2 GPUs with
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --gpus-per-task=2
and I want to use in parallel the 4 nodes to perform the tuning of my model. For each training I want to use 8 CPUs and 2 GPUs with DistributedDataParallel
. Therefore I have in the main function:
trainer = TorchTrainer(
train_loop_per_worker=train_loop_per_worker,
scaling_config=ScalingConfig(
resources_per_worker={"CPU": 8, "GPU": 2},
use_gpu=True,
num_workers=1,
_max_cpu_fraction_per_node=0.8
)
)
tuner = tune.Tuner(
trainer,
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
scheduler=scheduler,
num_samples=num_samples
),
param_space=config,
run_config=RunConfig(
local_dir="./results",
name="test_cifar",
sync_config=ray.tune.SyncConfig(sync_on_checkpoint=False))
)
results = tuner.fit()
and the training function is
def train_loop_per_worker(config):
print("Local rank", session.get_local_rank())
print("World size: ", session.get_world_size())
net = train.torch.prepare_model(net)
...
trainloader = train.torch.prepare_data_loader(trainloader)
valloader = train.torch.prepare_data_loader(valloader)
...
As you see in the status report that follows only two trials run in parallel and each of them apparently uses 9 CPU and 2 GPU. But it seems that the models are loaded and trained just on a single GPU since session.get_world_size()
is always 1 (see log below).
== Status ==
Current time: 2022-12-16 16:13:51 (running for 00:03:41.15)
Memory usage on this node: 28.4/187.2 GiB
Using AsyncHyperBand: num_stopped=5
Bracket: Iter 8.000: -1.2921145160436631 | Iter 4.000: -1.3493594537615776 | Iter 2.000: -1.4290564769506455 | Iter 1.000: -1.5200191211223602
Resources requested: 18.0/40 CPUs, 4.0/8 GPUs, 0.0/433.98 GiB heap, 0.0/189.98 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: bb708_00001 with loss=1.2228425681114197 and parameters={'train_loop_config': {'l1': 32, 'l2': 8, 'lr': 0.0028581659786108215, 'batch_size': 8}}
Result logdir: /gpfsdswork/projects/rech/mwh/ujn44cd/rayTune/results/test_cifar
Number of trials: 10/10 (3 PENDING, 2 RUNNING, 5 TERMINATED)
+--------------------------+------------+---------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+---------$
| Trial name | status | loc | train_loop_config/ba | train_loop_config/l1 | train_loop_config/l2 | train_loop_config/lr | iter | total time (s) | loss | accura$
| | | | tch_size | | | | | | | $
|--------------------------+------------+---------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+---------$
| TorchTrainer_bb708_00005 | RUNNING | 10.148.7.112:132639 | 4 | 128 | 16 | 0.0180833 | | | | $
| TorchTrainer_bb708_00006 | RUNNING | 10.148.8.98:4021279 | 2 | 16 | 256 | 0.088919 | | | | $
| TorchTrainer_bb708_00007 | PENDING | | 8 | 8 | 32 | 0.000215026 | | | | $
| TorchTrainer_bb708_00008 | PENDING | | 4 | 4 | 128 | 0.0144106 | | | | $
| TorchTrainer_bb708_00009 | PENDING | | 4 | 4 | 128 | 0.00260305 | | | | $
| TorchTrainer_bb708_00000 | TERMINATED | 10.148.8.97:177132 | 4 | 4 | 64 | 0.0918294 | 1 | 42.4549 | 2.33702 | 0.09$
| TorchTrainer_bb708_00001 | TERMINATED | 10.148.8.99:605598 | 8 | 32 | 8 | 0.00285817 | 10 | 181.016 | 1.22284 | 0.58$
| TorchTrainer_bb708_00002 | TERMINATED | 10.148.7.112:130711 | 8 | 32 | 256 | 0.00518242 | 2 | 43.8126 | 1.42906 | 0.50$
| TorchTrainer_bb708_00003 | TERMINATED | 10.148.8.97:178801 | 8 | 32 | 8 | 0.00472392 | 2 | 43.7392 | 1.46415 | 0.45$
| TorchTrainer_bb708_00004 | TERMINATED | 10.148.7.112:132584 | 16 | 8 | 32 | 0.0032123 | 1 | 18.5921 | 1.65015 | 0.38$
+--------------------------+------------+---------------------+------------------------+------------------------+------------------------+------------------------+--------+------------------+---------+---------$
A part of the log is :
^[[2m^[[36m(RayTrainWorker pid=129533, ip=10.159.48.31)^[[0m 2022-12-16 16:10:20,853 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=129533, ip=10.159.48.31)^[[0m 2022-12-16 16:10:21,419 INFO train_loop_utils.py:270 -- Moving model to device: cuda:1
^[[2m^[[36m(RayTrainWorker pid=4014276, ip=10.159.48.65)^[[0m 2022-12-16 16:10:26,291 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=4014276, ip=10.159.48.65)^[[0m 2022-12-16 16:10:26,854 INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=177233, ip=10.159.48.64)^[[0m 2022-12-16 16:11:13,360 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=177233, ip=10.159.48.64)^[[0m 2022-12-16 16:11:13,959 INFO train_loop_utils.py:270 -- Moving model to device: cuda:1
^[[2m^[[36m(RayTrainWorker pid=130787, ip=10.159.48.31)^[[0m 2022-12-16 16:12:03,403 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=130787, ip=10.159.48.31)^[[0m 2022-12-16 16:12:03,947 INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=178936, ip=10.159.48.64)^[[0m 2022-12-16 16:12:55,372 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=178936, ip=10.159.48.64)^[[0m 2022-12-16 16:12:55,889 INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=179804, ip=10.159.48.64)^[[0m 2022-12-16 16:13:20,351 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=179804, ip=10.159.48.64)^[[0m 2022-12-16 16:13:20,878 INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=605738, ip=10.159.48.66)^[[0m 2022-12-16 16:13:35,419 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=605738, ip=10.159.48.66)^[[0m 2022-12-16 16:13:35,977 INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
^[[2m^[[36m(RayTrainWorker pid=180681, ip=10.159.48.64)^[[0m 2022-12-16 16:14:08,362 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=180681, ip=10.159.48.64)^[[0m 2022-12-16 16:14:08,872 INFO train_loop_utils.py:270 -- Moving model to device: cuda:1
^[[2m^[[36m(RayTrainWorker pid=132822, ip=10.159.48.31)^[[0m 2022-12-16 16:14:40,552 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=132822, ip=10.159.48.31)^[[0m 2022-12-16 16:14:41,096 INFO train_loop_utils.py:270 -- Moving model to device: cuda:1
^[[2m^[[36m(RayTrainWorker pid=4021417, ip=10.159.48.65)^[[0m 2022-12-16 16:14:58,410 INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=1]
^[[2m^[[36m(RayTrainWorker pid=4021417, ip=10.159.48.65)^[[0m 2022-12-16 16:14:58,995 INFO train_loop_utils.py:270 -- Moving model to device: cuda:0
What is wrong in my code? Am I assigning wrongly the resources to the TorchTrainer? Thank you for the help !