Ray on Private Cluster and Pytorch Lightning

I am trying to run my experiments in a private cluster. I am starting ray in the cluster following the documentation. First, by running it on the head node with
pyenv exec ray start --head --port=$head_port --num-cpus=$HEAD_NUM_CPUS --num-gpus=$HEAD_NUM_GPUS
and then for each other node I run the respective command:
pyenv exec ray start --address=$head_full_address --num-cpus=$WORKER_NUM_CPUS --num-gpus=$WORKER_NUM_GPUS"
In the variables HEAD_NUM_CPUS and HEAD_NUM_GPUS I am setting properly the correct number of cpus and gpus I have for the head node, and the same for the ‘worker’ nodes for the variables WORKER_NUM_CPUS and WORKER_NUM_GPUS.

By looking at the trials, they are correctly set to RUNNING status and the logs state that the GPU is available and used.

The problems is the following (I’ve masked the ip):
2023-03-15 15:38:32,710 ERROR trial_runner.py:1062 -- Trial train_with_parameters_f12f6_00003: Error processing event. ray.exceptions.RayTaskError(AssertionError): e[36mray::ImplicitFunc.train()e[39m (pid=12595, ip=000.000.000.000, repr=train_with_parameters) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 368, in train raise skipped from exception_cause(skipped) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 337, in entrypoint return self._trainable_func( File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 654, in _trainable_func output = fn() File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 406, in _inner return inner(config, checkpoint_dir=None) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/tune/trainable/util.py", line 398, in inner return trainable(config, **fn_kwargs) File "/work/user/hpc_training/ancient_docs_context_awareness/base_experiment.py", line 74, in train_with_parameters trainer.fit( File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1048, in _run self.strategy.setup_environment() File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 131, in setup_environment self.accelerator.setup_device(self.root_device) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/pytorch_lightning/accelerators/cuda.py", line 43, in setup_device _check_cuda_matmul_precision(device) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/lightning_fabric/accelerators/cuda.py", line 346, in _check_cuda_matmul_precision major, _ = torch.cuda.get_device_capability(device) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability prop = get_device_properties(device) File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/torch/cuda/__init__.py", line 374, in get_device_properties raise AssertionError("Invalid device id") AssertionError: Invalid device id

It seems that the devide id is not valid, but in the Trainer function of PL I’ve set both:
accelerator="auto"
devices="auto".

Thus, in theory, PL should figure out by itself what are the devices to use. In my trials I set to use 1 GPU per trial.
Thanks for any help,
S

@SFrr Can you include your training script.py? It might be easier to launch the cluster with a cluster launcher.

When you say you run experiments on a private cluster, is that on-premise or specific vendor cloud?

cc: @Yard1 @matthewtang any ideas about PL

Yes, I am meaning on-premise cluster.
Here, I attach the script.py

import argparse
from lightning.module import Recognizer

def train_with_parameters(
    config,
    args: argparse.ArgumentParser,
    trload: DataLoader,
    vlload: DataLoader
):
    '''
    Args
    ----
    :param dict config: the search space for the parameters we want to tune
    :param argparse.ArgumentParser args: all the arguments which are necessary 
    for the Trainer
    :param DataLoader trload: the DataLoader for the training set
    :param DataLoader vlload: the DataLoader for the validation set
    '''
    args.accelerator = 'auto'
    args.devices = 'auto'
    args.precision = 16

    # we overwrite the parameters which we use for the HPO
    args.lr = config['lr']
    args.weight = config['weight']
        
    # define the callbaks
    args.callbacks = [
        TuneReportCallback(
            {
                "val_loss": "val_loss",
                "val_acc": "val_acc"
            }),
        ModelCheckpoint(
            save_top_k=1,
            monitor="val_acc",
            filename='{epoch}-{val_acc:.2f}-{val_loss:.4f}'
        ),
        DeviceStatsMonitor(),
    ]

    criterion = nn.CrossEntropyLoss
    
    model = Recognizer(
        criterion=criterion,
        **vars(args)
    )
   
    trainer.fit(
        model,
        trload,
        vlload
    )

This is called by:

    # tune.with_parameters() is to wrap araund the training function @train_with_parameters@
    # not to pass constants to the configuration
    train_fn_with_parameters = tune.with_parameters(
        train_with_parameters,
        args=args,
        trload=trload,
        vlload=vlload
    )

which is used in:

tuner = tune.Tuner(
        tune.with_resources(
            train_fn_with_parameters,
            resources={"cpu": CPUS_PER_TRIAL, "gpu": GPUS_PER_TRIAL},
        ),
        # this is the callback
        tune_config=tune.TuneConfig(
            metric="val_loss",
            mode="min",
            scheduler=scheduler,
            search_alg=search_algorithm,
            num_samples=8,
            time_budget_s=TIME_BUDGET_S_TOTAL,
            reuse_actors=False,
        ),
        run_config=air.RunConfig(
            local_dir=args.default_root_dir,
            name=f"{short_name_dset}_{args.backbone}_hpo",
            log_to_file=True,
            progress_reporter=CLIReporter(
                parameter_columns=[
                    'lr',
                    'weight'
                ],
                metric_columns=[
                    "val_loss",
                    "val_acc",
                    "training_iteration"
                ]
            ),
            sync_config=tune.SyncConfig(syncer=None),
        ),
        param_space=params_space,
    )
    results = tuner.fit()
    print(f"Best result config: {results.get_best_result().config}")
    print(f"Best result metrics: {results.get_best_result().metrics}")

One more thing:
I tried to overwrite “CUDA_VISIBLE_DEVICES” just before calling trainer.fit(), as follows

import os
s = os.environ["CUDA_VISIBLE_DEVICES"]
print(s)

n_devices = len(s.split(','))
print('\n')
print(f'N. of detected CUDA devices {n_devices}')
s_out = ''
for d in range(n_devices):
    if d != n_devices -1:
         s_out += str(d) +', '
     else:
         s_out += str(d)
os.environ["CUDA_VISIBLE_DEVICES"]=s_out

This permits the correct functioning of the nodes with 1 GPU, only, but, for the nodes which for sure have more than 1 GPU free I have the following error. So it seems that the variable CUDA_VISIBLE_DEVICES is not set properly. For my experience CUDA_VISIBLE_DEVICES should have a string with the indices of the visible devices, as “0, 1”, however I have something like [‘GPU-2b3a5la9-f1fe-eb5e-214c-51dd4ea20d5c’].

I get the resources by looking at PBS_NODELIST which contains the names of the nodes from which I am getting the resources.

Traceback (most recent call last):
  File "/opt/pyenv/versions/3.9.13/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/scripts/scripts.py", line 2422, in main
    return cli()
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
    return f(*args, **kwargs)
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/scripts/scripts.py", line 868, in start
    node = ray._private.node.Node(
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/_private/node.py", line 290, in __init__
    self.start_ray_processes()
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/_private/node.py", line 1173, in start_ray_processes
    resource_spec = self.get_resource_spec()
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/_private/node.py", line 459, in get_resource_spec
    self._resource_spec = ResourceSpec(
  File "/opt/pyenv/versions/3.9.13/lib/python3.9/site-packages/ray/_private/resource_spec.py", line 169, in resolve
    raise ValueError(
ValueError: Attempting to start raylet with 2 GPUs, but CUDA_VISIBLE_DEVICES contains ['GPU-2b3a5la9-f1fe-eb5e-214c-51dd4ea20d5c'].