Tune + Pytorch Lightning on Slurm: How to correctly assign the resources

Matteo_Bastico · January 8, 2023, 5:43pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hello! I am trying to deploy my Tune application on Slurm following this tutorial. Therefore, if I have 4 nodes each with 4 GPUs and 12 CPUs, my batch script is the following

#SBATCH --job-name=test
#SBATCH --output=test.out
#SBATCH --error=test.err
#SBATCH --partition=gpu_p2
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --gres=gpu:4
#SBATCH --time=20:00:00
#SBATCH --hint=nomultithread
#SBATCH --account=mwh@v100

set -x

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
  head_node_ip=${ADDR[1]}
else
  head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi
port=6379
ip_head=$head_node_ip:$port
export ip_head
export head_node_ip
echo "IP Head: $ip_head"
echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
    ray start --include-dashboard=True --head --node-ip-address="$head_node_ip" --port=$port \
    --num-cpus 24 --num-gpus 8 --block &
sleep 10
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 -w "$node_i" \
        ray start --address "$ip_head" \
        --num-cpus 24 --num-gpus 8  --block &
    sleep 5
done

python -u test.py

My test script is inspired by the tutorial to use PyTorch Lightning with Tune and the model used is exactly the same LightningMNISTClassifier.

def train_mnist_tune(
        config,
        num_epochs=10,
        devices=1,
        accelerator="gpu",
        data_dir="~/"):
    data_dir = os.path.expanduser(data_dir)
    model = LightningMNISTClassifier(config, data_dir)
    tune_callback = TuneReportCallback(
        {
            "loss": "ptl/val_loss",
            "mean_accuracy": "ptl/val_accuracy"
        },
	    on="validation_end")
    trainer = pl.Trainer(
        max_epochs=num_epochs,
        devices=devices,
        accelerator=accelerator,
        enable_progress_bar=False,
        callbacks=[tune_callback],
        fast_dev_run=True,
    )
    trainer.fit(model)


if __name__ == "__main__":
    ray.init(address=os.environ["ip_head"], _node_ip_address=os.environ["head_node_ip"])
    config = {
	"layer_1_size": tune.choice([32, 64, 128]),
        "layer_2_size": tune.choice([64, 128, 256]),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([32, 64, 128]),
    }
    num_epochs = 10
    num_samples = 10
    # Selecting the scheduler
    scheduler = ASHAScheduler(
        max_t=num_epochs,
        grace_period=1,
        reduction_factor=2
    )
    reporter = CLIReporter(
        parameter_columns=["layer_1_size", "layer_2_size", "lr", "batch_size"],
        metric_columns=["loss", "mean_accuracy", "training_iteration"]
    )
    accelerator = "gpu"
    gpus_per_trial = 2
    cpus_per_trial = 6
    data_dir = '~/'
    train_fn_with_parameters = tune.with_parameters(train_mnist_tune,
                                                    num_epochs=num_epochs,
                                                    accelerator=accelerator,
                                                    devices=gpus_per_trial,
                                                    data_dir=data_dir)
    tuner = tune.Tuner(
        tune.with_resources(
            train_fn_with_parameters,
            resources={
                "CPU": cpus_per_trial, 
                "GPU": gpus_per_trial}
	    ),
	    tune_config=tune.TuneConfig(
            metric="loss",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples,
        ),
	    run_config=air.RunConfig(
            name="test",
            progress_reporter=reporter,
	    ),
	    param_space=config,
    )
    results = tuner.fit()

    print("Best hyperparameters found were: ", results.get_best_result().config)

The cluster is correctly set up and the first trial are assigned to the 4 nodes. But only the head node actually runs the training and the other 3 nodes seem blocked somewhere without giving any error. As you can see in the following log, the 3 nodes are blocked after Initializing distributed

^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m GPU available: True (cuda), used: True
^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m TPU available: False, using: 0 TPU cores
^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m IPU available: False, using: 0 IPUs
^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m HPU available: False, using: 0 HPUs
^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m GPU available: True (cuda), used: True
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m TPU available: False, using: 0 TPU cores
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m IPU available: False, using: 0 IPUs
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m HPU available: False, using: 0 HPUs
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m GPU available: True (cuda), used: True
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m TPU available: False, using: 0 TPU cores
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m IPU available: False, using: 0 IPUs
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m HPU available: False, using: 0 HPUs
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m GPU available: True (cuda), used: True
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m TPU available: False, using: 0 TPU cores
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m IPU available: False, using: 0 IPUs
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m HPU available: False, using: 0 HPUs
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m ----------------------------------------------------------------------------------------------------
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m distributed_backend=nccl
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m All distributed processes registered. Starting with 1 processes
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m ----------------------------------------------------------------------------------------------------
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
...

In the Tune report all the TERMINATED trials are on the head node while the others are forever RUNNING

== Status ==
Current time: 2023-01-08 18:25:37 (running for 00:06:46.95)
Memory usage on this node: 28.9/754.2 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: -0.11974430084228516 | Iter 4.000: -0.18529099225997925 | Iter 2.000: -0.21176335960626602 | Iter 1.000: -0.2983490526676178
Resources requested: 36.0/48 CPUs, 12.0/16 GPUs, 0.0/2213.81 GiB heap, 0.0/745.06 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: 84964_00006 with loss=0.09475510567426682 and parameters={'layer_1_size': 32, 'layer_2_size': 256, 'lr': 0.002361851721705818, 'batch_size': 128}
Result logdir: /linkhome/rech/gendzm01/ujn44cd/ray_results/test_tb
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+------------------------------+------------+---------------------+----------------+----------------+-------------+--------------+-----------+-----------------+----------------------+
| Trial name                   | status     | loc                 |   layer_1_size |   layer_2_size |          lr |   batch_size |	loss |   mean_accuracy |   training_iteration |
|------------------------------+------------+---------------------+----------------+----------------+-------------+--------------+-----------+-----------------+----------------------|
| train_mnist_tune_84964_00000 | RUNNING    | 10.148.8.9:3679869  |             64 |            128 | 0.0454478   |           32 |           |                 |                      |
| train_mnist_tune_84964_00001 | RUNNING    | 10.148.8.11:3181124 |             32 |            128 | 0.00019643  |           64 |           |                 |                      |
| train_mnist_tune_84964_00002 | RUNNING    | 10.148.8.10:1745151 |            128 |            128 | 0.0033367   |           64 |           |                 |                      |
| train_mnist_tune_84964_00003 | TERMINATED | 10.148.8.8:3005886  |             32 |             64 | 0.00155658  |           64 | 0.130796  |        0.960248 |                   10 |
| train_mnist_tune_84964_00004 | TERMINATED | 10.148.8.8:3005886  |             64 |             64 | 0.000845919 |           32 | 0.30242   |        0.910937 |                    1 |
| train_mnist_tune_84964_00005 | TERMINATED | 10.148.8.8:3005886  |             64 |             64 | 0.0172056   |           64 | 0.432142  |        0.871369 |                    1 |
| train_mnist_tune_84964_00006 | TERMINATED | 10.148.8.8:3005886  |             32 |            256 | 0.00236185  |          128 | 0.0947551 |        0.971221 |                   10 |
| train_mnist_tune_84964_00007 | TERMINATED | 10.148.8.8:3005886  |            128 |            128 | 0.00231834  |           32 | 0.130806  |        0.967969 |                   10 |
| train_mnist_tune_84964_00008 | TERMINATED | 10.148.8.8:3005886  |            128 |            128 | 0.000190201 |          128 | 0.564611  |        0.861432 |                    1 |
| train_mnist_tune_84964_00009 | TERMINATED | 10.148.8.8:3005886  |            128 |            256 | 0.00970075  |          128 | 0.219292  |        0.928922 |                    2 |
+------------------------------+------------+---------------------+----------------+----------------+-------------+--------------+-----------+-----------------+----------------------+

I guess I am wrongly assigning the resources to the Pytorch Lightning trainers in tune.with_resources(). Should I use PlacementGroupFactory(), ScalingConfig() or something else?

Moreover, I also found the Ray Lightning library using a RayStrategy in the Lightning Trainer but, unfortunately, it is not compatible with the latest versions of Pytorch Lightning.

Can someone suggest me which is the correct way to assign the resources using Tune + Pytorch Lightning on Slurm? Thank you!

bveeramani · January 12, 2023, 9:15pm

Hey @Matteo_Bastico,

I’m not sure what the issue is. I took a look at your code and it looks okay to me.

Could you check the stack trace for hanging trials? If you’re using Ray 2.2, you can do so by going to the Actors tab of the Ray Dashboard.

Topic		Replies	Views
How to correclty allocate resources with Tune + TorchTrainer on Slurm	2	452	December 20, 2022
Running RayTune on Slurm Cluster in PyTorch Lightning Ray Tune	1	430	February 13, 2023
Need help running tuning job on SLURM cluster with pytorch-lightning Ray Tune	7	1621	March 8, 2021
Distributed Training & Distributed Tuning using Ray Tune, PLT, Ray Lightning Ray Clusters	1	376	April 25, 2022
Reserve workers on GPU node for trainer workers only RLlib	7	1117	June 3, 2022

Tune + Pytorch Lightning on Slurm: How to correctly assign the resources

Related topics