I am trying to run my experiments in a private cluster. I am starting ray in the cluster following the documentation. First, by running it on the head node with
pyenv exec ray start --head --port=$head_port --num-cpus=$HEAD_NUM_CPUS --num-gpus=$HEAD_NUM_GPUS
and then for each other node I run the respective command:
pyenv exec ray start --address=$head_full_address --num-cpus=$WORKER_NUM_CPUS --num-gpus=$WORKER_NUM_GPUS"
In the variables HEAD_NUM_CPUS
and HEAD_NUM_GPUS
I am setting properly the correct number of cpus and gpus I have for the head node, and the same for the ‘worker’ nodes for the variables WORKER_NUM_CPUS
and WORKER_NUM_GPUS
.
By looking at the trials, they are correctly set to RUNNING
status and the logs state that the GPU is available and used.
The problems is the following (I’ve masked the ip):
2023-03-15 15:38:32,710 ERROR trial_runner.py:1062 -- Trial train_with_parameters_f12f6_00003: Error processing event. ray.exceptions.RayTaskError(AssertionError): e[36mray::ImplicitFunc.train()e[39m (pid=12595, ip=000.000.000.000, repr=train_with_parameters) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 368, in train raise skipped from exception_cause(skipped) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 337, in entrypoint return self._trainable_func( File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 654, in _trainable_func output = fn() File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 406, in _inner return inner(config, checkpoint_dir=None) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 398, in inner return trainable(config, **fn_kwargs) File "/work/user/hpc_training/ancient_docs_context_awareness/base_experiment.py", line 74, in train_with_parameters trainer.fit( File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1048, in _run self.strategy.setup_environment() File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 131, in setup_environment self.accelerator.setup_device(self.root_device) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/accelerators/cuda.py", line 43, in setup_device _check_cuda_matmul_precision(device) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/lightning_fabric/accelerators/cuda.py", line 346, in _check_cuda_matmul_precision major, _ = torch.cuda.get_device_capability(device) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability prop = get_device_properties(device) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 374, in get_device_properties raise AssertionError("Invalid device id") AssertionError: Invalid device id
It seems that the devide id is not valid, but in the Trainer
function of PL I’ve set both:
accelerator="auto"
devices="auto"
.
Thus, in theory, PL should figure out by itself what are the devices to use. In my trials I set to use 1 GPU per trial.
Thanks for any help,
S