Multi-gpu ray tune for hparams not parallelizing and only using first gpu

I’m running multiple trials of hparam optimization on a lstm model built on lightning (from u8darts). This is on a multi-gpu node with ScalingConfig, but I’m having issues getting it to parallelize properly. I’m also having a hard time finding a contrast between ScalingConfig and tune.with_resources - I chose the config setup for readability and modularity.

The pl.trainer in my function has devices='auto' which should be 1, as I pass 1 GPU per worker in the config. I’m running on 8 GPUs at once. The data that all the models run on is the same, but the parameters change each run - I read a pkl file in main, set the datasets as train_ref, val_ref = ray.put(train), ray.put(val) and pass that in through the parameter config (not ideal, but I couldn’t figure out a better way).

num_devices = torch.cuda.device_count()
num_devices = 1 if num_devices==0 else num_devices
print("num_devices: " + str(num_devices))

scaling_config = ScalingConfig(
    num_workers= num_devices, use_gpu=True,
    accelerator_type="A100",
    resources_per_worker={"CPU": 2, "GPU": 1}
)

ray_trainer = TorchTrainer(
    train_loop_per_worker=train_model,
    torch_config=TorchConfig(backend="gloo"),
    scaling_config=scaling_config,
    run_config=run_config
)

Skipping some things, I have a tune.Tuner that takes in ray_trainer, the parameter space, and a TuneConfig. In main, I run tuner.fit() with num_samples=16.

My expectation is that 8 trials should be running at any given time, yet I only see one trial running at a time. The printout from ray:
Logical resource usage: 17.0/64 CPUs, 8.0/8 GPUs (0.008/1.0 accelerator_type:A100)
while 1 trial is running and all others are pending. Instead, I get things like

(RayTrainWorker pid=737690) Setting up process group for: env:// [rank=0, world_size=8]
(TorchTrainer pid=737575) Started distributed worker processes:
(TorchTrainer pid=737575) - (ip=x, pid=737690) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=737575) - (ip=x, pid=737691) world_rank=1, local_rank=1, node_rank=0
(TorchTrainer pid=737575) - (ip=x, pid=737692) world_rank=2, local_rank=2, node_rank=0
(TorchTrainer pid=737575) - (ip=x, pid=737693) world_rank=3, local_rank=3, node_rank=0
(TorchTrainer pid=737575) - (ip=x, pid=737694) world_rank=4, local_rank=4, node_rank=0
(TorchTrainer pid=737575) - (ip=x, pid=737696) world_rank=5, local_rank=5, node_rank=0
(TorchTrainer pid=737575) - (ip=x, pid=737697) world_rank=6, local_rank=6, node_rank=0
(TorchTrainer pid=737575) - (ip=x, pid=737701) world_rank=7, local_rank=7, node_rank=0
...
(RayTrainWorker pid=737701) LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7] [repeated 7x across cluster]
Finding best initial lr:   1%|          | 1/100 [00:03<05:37,  3.41s/it]
Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s] [repeated 7x across cluster]
Finding best initial lr:   4%|▍         | 4/100 [00:09<03:26,  2.15s/it] [repeated 24x across cluster]

I don’t understand why it’s repeating the same thing across the cluster instead of running multiple trials, as it clearly initializes the workers. How can I configure this to properly distribute the workload? I tried GPU 0.5 and that’ll run two trials but it’s definitely not what I’m looking for.

Weirdly enough, this also wouldn’t run on an nccl backend - I kept getting No backend type associated with device type cpu although my money is on something not getting moved around properly in lightning.

Pastebin of my somewhat cleaned up code. It needs some weird edits to imports because of pytorch_lightning vs lightning.pytorch conflicts.