How severe does this issue affect your experience of using Ray?
- Low: It annoys or frustrates me for a moment.
Versions:
Python 3.10.6
Ray 2.32.0
I have setup Ray to work on my SLURM server. Ray itself runs fine and I am able to run mutli-node and multi-gpu. The final issue I’m facing is the Tensorboard log not outputting to the directory listed in the ray output.
Here’s my test setup:
def main(num_samples=10, max_num_epochs=10, smoke_test=False):
args = parse_args()
print(f"CPUs per trial: {args.cpu_per_trial}")
print(f"GPUs per trial: {args.gpu_per_trial}")
config = {
"l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([2, 4, 8, 16]),
"smoke_test": smoke_test,
}
from ray.train import ScalingConfig, CheckpointConfig #,RunConfig
from ray.air import RunConfig
from ray.tune.logger import TBXLoggerCallback
tensorboard_callback = TBXLoggerCallback()
scaling_config = ScalingConfig(
num_workers=1, use_gpu=True, resources_per_worker={"CPU": args.cpu_per_trial, "GPU": args.gpu_per_trial}
)
#
run_config = RunConfig(storage_path="/n/home11/nswood/HPC_Parallel_Computing/storage",
callbacks=[tensorboard_callback],
checkpoint_config=CheckpointConfig(
num_to_keep=2,
checkpoint_score_attribute="loss",
checkpoint_score_order="max",
),
)
from ray.train.torch import TorchTrainer
# Define a TorchTrainer without hyper-parameters for Tuner
ray_trainer = TorchTrainer(
train_cifar,
scaling_config=scaling_config,
run_config=run_config,
)
scheduler = ASHAScheduler(
max_t=max_num_epochs,
grace_period=1,
reduction_factor=2)
tuner = tune.Tuner(
ray_trainer,
param_space={"train_loop_config": config},
tune_config=tune.TuneConfig(
metric="loss",
mode="max",
num_samples=num_samples,
scheduler=scheduler
),
)
results = tuner.fit()
best_result = results.get_best_result("loss", "min")
print("Best trial config: {}".format(best_result.config))
print("Best trial final validation loss: {}".format(
best_result.metrics["loss"]))
print("Best trial final validation accuracy: {}".format(
best_result.metrics["accuracy"]))
test_best_model(best_result, smoke_test=smoke_test)
When I run this test, I get the following output
View detailed results here: /n/home11/nswood/HPC_Parallel_Computing/storage/TorchTrainer_2024-07-17_11-37-01
To visualize your results with TensorBoard, run:tensorboard --logdir /tmp/nswood/ray/session_2024-07-17_11-36-53_340582_4172020/artifacts/2024-07-17_11-37-01/TorchTrainer_2024-07-17_11-37-01/driver_artifacts
The logs to the storage folder are appearing as normal, however the folder /tmp/nswood/ray
is empty.
Through specifying storage_path
I am able to change where the detailed results are output, but the Tensorboard log seems to always try to output to /tmp/nswood/ray/
without success.
This problem persists whether or not I explicitly pass callbacks=[tensorboard_callback]
or specify the storage_path
.
Here’s my SLURM configuration script if applicable:
echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
ray start --head --node-ip-address="$head_node_ip" \
--port=$port \
--node-manager-port=$nodeManagerPort \
--object-manager-port=$objectManagerPort \
--ray-client-server-port=$rayClientServerPort \
--redis-shard-ports=$redisShardPorts \
--min-worker-port=$minWorkerPort \
--max-worker-port=$maxWorkerPort \
--redis-password=$redis_password \
--num-cpus "${SLURM_CPUS_PER_TASK}" \
--num-gpus "${SLURM_GPUS_PER_TASK}" \
--block &
sleep 10
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
echo "Starting WORKER $i at $node_i"
srun --nodes=1 --ntasks=1 -w "$node_i" \
ray start --address "$ip_head" \
--redis-password=$redis_password \
--num-cpus "${SLURM_CPUS_PER_TASK}" \
--num-gpus "${SLURM_GPUS_PER_TASK}" \
--block &
sleep 5
done
Any advice on how to resolve this so I can monitory my Ray progress would be greatly appreciated!