How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello! I am trying to deploy my Tune application on Slurm following this tutorial. Therefore, if I have 4 nodes each with 4 GPUs and 12 CPUs, my batch script is the following
#SBATCH --job-name=test
#SBATCH --output=test.out
#SBATCH --error=test.err
#SBATCH --partition=gpu_p2
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --gres=gpu:4
#SBATCH --time=20:00:00
#SBATCH --hint=nomultithread
#SBATCH --account=mwh@v100
set -x
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_node_ip=${ADDR[1]}
else
head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
fi
port=6379
ip_head=$head_node_ip:$port
export ip_head
export head_node_ip
echo "IP Head: $ip_head"
echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
ray start --include-dashboard=True --head --node-ip-address="$head_node_ip" --port=$port \
--num-cpus 24 --num-gpus 8 --block &
sleep 10
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo "Starting WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w "$node_i" \
ray start --address "$ip_head" \
--num-cpus 24 --num-gpus 8 --block &
sleep 5
done
python -u test.py
My test script is inspired by the tutorial to use PyTorch Lightning with Tune and the model used is exactly the same LightningMNISTClassifier
.
def train_mnist_tune(
config,
num_epochs=10,
devices=1,
accelerator="gpu",
data_dir="~/"):
data_dir = os.path.expanduser(data_dir)
model = LightningMNISTClassifier(config, data_dir)
tune_callback = TuneReportCallback(
{
"loss": "ptl/val_loss",
"mean_accuracy": "ptl/val_accuracy"
},
on="validation_end")
trainer = pl.Trainer(
max_epochs=num_epochs,
devices=devices,
accelerator=accelerator,
enable_progress_bar=False,
callbacks=[tune_callback],
fast_dev_run=True,
)
trainer.fit(model)
if __name__ == "__main__":
ray.init(address=os.environ["ip_head"], _node_ip_address=os.environ["head_node_ip"])
config = {
"layer_1_size": tune.choice([32, 64, 128]),
"layer_2_size": tune.choice([64, 128, 256]),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([32, 64, 128]),
}
num_epochs = 10
num_samples = 10
# Selecting the scheduler
scheduler = ASHAScheduler(
max_t=num_epochs,
grace_period=1,
reduction_factor=2
)
reporter = CLIReporter(
parameter_columns=["layer_1_size", "layer_2_size", "lr", "batch_size"],
metric_columns=["loss", "mean_accuracy", "training_iteration"]
)
accelerator = "gpu"
gpus_per_trial = 2
cpus_per_trial = 6
data_dir = '~/'
train_fn_with_parameters = tune.with_parameters(train_mnist_tune,
num_epochs=num_epochs,
accelerator=accelerator,
devices=gpus_per_trial,
data_dir=data_dir)
tuner = tune.Tuner(
tune.with_resources(
train_fn_with_parameters,
resources={
"CPU": cpus_per_trial,
"GPU": gpus_per_trial}
),
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
scheduler=scheduler,
num_samples=num_samples,
),
run_config=air.RunConfig(
name="test",
progress_reporter=reporter,
),
param_space=config,
)
results = tuner.fit()
print("Best hyperparameters found were: ", results.get_best_result().config)
The cluster is correctly set up and the first trial are assigned to the 4 nodes. But only the head node actually runs the training and the other 3 nodes seem blocked somewhere without giving any error. As you can see in the following log, the 3 nodes are blocked after Initializing distributed
^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m GPU available: True (cuda), used: True
^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m TPU available: False, using: 0 TPU cores
^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m IPU available: False, using: 0 IPUs
^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m HPU available: False, using: 0 HPUs
^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m GPU available: True (cuda), used: True
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m TPU available: False, using: 0 TPU cores
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m IPU available: False, using: 0 IPUs
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m HPU available: False, using: 0 HPUs
^[[2m^[[36m(train_mnist_tune pid=3181124, ip=10.148.8.11)^[[0m Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m GPU available: True (cuda), used: True
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m TPU available: False, using: 0 TPU cores
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m IPU available: False, using: 0 IPUs
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m HPU available: False, using: 0 HPUs
^[[2m^[[36m(train_mnist_tune pid=1745151, ip=10.148.8.10)^[[0m Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m GPU available: True (cuda), used: True
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m TPU available: False, using: 0 TPU cores
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m IPU available: False, using: 0 IPUs
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m HPU available: False, using: 0 HPUs
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m ----------------------------------------------------------------------------------------------------
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m distributed_backend=nccl
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m All distributed processes registered. Starting with 1 processes
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m ----------------------------------------------------------------------------------------------------
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m
^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
...
In the Tune report all the TERMINATED trials are on the head node while the others are forever RUNNING
== Status ==
Current time: 2023-01-08 18:25:37 (running for 00:06:46.95)
Memory usage on this node: 28.9/754.2 GiB
Using AsyncHyperBand: num_stopped=7
Bracket: Iter 8.000: -0.11974430084228516 | Iter 4.000: -0.18529099225997925 | Iter 2.000: -0.21176335960626602 | Iter 1.000: -0.2983490526676178
Resources requested: 36.0/48 CPUs, 12.0/16 GPUs, 0.0/2213.81 GiB heap, 0.0/745.06 GiB objects (0.0/4.0 accelerator_type:V100)
Current best trial: 84964_00006 with loss=0.09475510567426682 and parameters={'layer_1_size': 32, 'layer_2_size': 256, 'lr': 0.002361851721705818, 'batch_size': 128}
Result logdir: /linkhome/rech/gendzm01/ujn44cd/ray_results/test_tb
Number of trials: 10/10 (3 RUNNING, 7 TERMINATED)
+------------------------------+------------+---------------------+----------------+----------------+-------------+--------------+-----------+-----------------+----------------------+
| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |
|------------------------------+------------+---------------------+----------------+----------------+-------------+--------------+-----------+-----------------+----------------------|
| train_mnist_tune_84964_00000 | RUNNING | 10.148.8.9:3679869 | 64 | 128 | 0.0454478 | 32 | | | |
| train_mnist_tune_84964_00001 | RUNNING | 10.148.8.11:3181124 | 32 | 128 | 0.00019643 | 64 | | | |
| train_mnist_tune_84964_00002 | RUNNING | 10.148.8.10:1745151 | 128 | 128 | 0.0033367 | 64 | | | |
| train_mnist_tune_84964_00003 | TERMINATED | 10.148.8.8:3005886 | 32 | 64 | 0.00155658 | 64 | 0.130796 | 0.960248 | 10 |
| train_mnist_tune_84964_00004 | TERMINATED | 10.148.8.8:3005886 | 64 | 64 | 0.000845919 | 32 | 0.30242 | 0.910937 | 1 |
| train_mnist_tune_84964_00005 | TERMINATED | 10.148.8.8:3005886 | 64 | 64 | 0.0172056 | 64 | 0.432142 | 0.871369 | 1 |
| train_mnist_tune_84964_00006 | TERMINATED | 10.148.8.8:3005886 | 32 | 256 | 0.00236185 | 128 | 0.0947551 | 0.971221 | 10 |
| train_mnist_tune_84964_00007 | TERMINATED | 10.148.8.8:3005886 | 128 | 128 | 0.00231834 | 32 | 0.130806 | 0.967969 | 10 |
| train_mnist_tune_84964_00008 | TERMINATED | 10.148.8.8:3005886 | 128 | 128 | 0.000190201 | 128 | 0.564611 | 0.861432 | 1 |
| train_mnist_tune_84964_00009 | TERMINATED | 10.148.8.8:3005886 | 128 | 256 | 0.00970075 | 128 | 0.219292 | 0.928922 | 2 |
+------------------------------+------------+---------------------+----------------+----------------+-------------+--------------+-----------+-----------------+----------------------+
I guess I am wrongly assigning the resources to the Pytorch Lightning trainers in tune.with_resources()
. Should I use PlacementGroupFactory()
, ScalingConfig()
or something else?
Moreover, I also found the Ray Lightning library using a RayStrategy
in the Lightning Trainer but, unfortunately, it is not compatible with the latest versions of Pytorch Lightning.
Can someone suggest me which is the correct way to assign the resources using Tune + Pytorch Lightning on Slurm? Thank you!