TBXLoggerCallback not being output in listed directory

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Versions:

Python 3.10.6
Ray 2.32.0

I have setup Ray to work on my SLURM server. Ray itself runs fine and I am able to run mutli-node and multi-gpu. The final issue I’m facing is the Tensorboard log not outputting to the directory listed in the ray output.

Here’s my test setup:

def main(num_samples=10, max_num_epochs=10, smoke_test=False):
    args = parse_args()
    print(f"CPUs per trial: {args.cpu_per_trial}")
    print(f"GPUs per trial: {args.gpu_per_trial}")

    config = {
        "l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16]),
        "smoke_test": smoke_test,
    }
    
    
    from ray.train import ScalingConfig, CheckpointConfig #,RunConfig
    
    from ray.air import RunConfig
    
    from ray.tune.logger import TBXLoggerCallback
    
    tensorboard_callback = TBXLoggerCallback()
    

    scaling_config = ScalingConfig(
        num_workers=1, use_gpu=True, resources_per_worker={"CPU": args.cpu_per_trial, "GPU": args.gpu_per_trial}
    )
    #
    run_config = RunConfig(storage_path="/n/home11/nswood/HPC_Parallel_Computing/storage",
        callbacks=[tensorboard_callback],
        checkpoint_config=CheckpointConfig(
            num_to_keep=2,
            checkpoint_score_attribute="loss",
            checkpoint_score_order="max",
            
        ),
    )
    
    from ray.train.torch import TorchTrainer

    # Define a TorchTrainer without hyper-parameters for Tuner
    ray_trainer = TorchTrainer(
        train_cifar,
        scaling_config=scaling_config,
        run_config=run_config,
    )
    
    scheduler = ASHAScheduler(
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2)
    
    tuner = tune.Tuner(
        ray_trainer,
        param_space={"train_loop_config": config},
        tune_config=tune.TuneConfig(
            metric="loss",
            mode="max",
            num_samples=num_samples,
            scheduler=scheduler
            
        ),
    )
    
    results = tuner.fit()
    best_result = results.get_best_result("loss", "min")
    print("Best trial config: {}".format(best_result.config))
    print("Best trial final validation loss: {}".format(
        best_result.metrics["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_result.metrics["accuracy"]))

    test_best_model(best_result, smoke_test=smoke_test)

When I run this test, I get the following output

View detailed results here: /n/home11/nswood/HPC_Parallel_Computing/storage/TorchTrainer_2024-07-17_11-37-01
To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/nswood/ray/session_2024-07-17_11-36-53_340582_4172020/artifacts/2024-07-17_11-37-01/TorchTrainer_2024-07-17_11-37-01/driver_artifacts

The logs to the storage folder are appearing as normal, however the folder /tmp/nswood/ray is empty.

Through specifying storage_path I am able to change where the detailed results are output, but the Tensorboard log seems to always try to output to /tmp/nswood/ray/ without success.

This problem persists whether or not I explicitly pass callbacks=[tensorboard_callback] or specify the storage_path.

Here’s my SLURM configuration script if applicable:


echo "Starting HEAD at $head_node"
srun --nodes=1 --ntasks=1 -w "$head_node" \
    ray start --head --node-ip-address="$head_node_ip" \
        --port=$port \
        --node-manager-port=$nodeManagerPort \
        --object-manager-port=$objectManagerPort \
        --ray-client-server-port=$rayClientServerPort \
        --redis-shard-ports=$redisShardPorts \
        --min-worker-port=$minWorkerPort \
        --max-worker-port=$maxWorkerPort \
        --redis-password=$redis_password \
        --num-cpus "${SLURM_CPUS_PER_TASK}" \
        --num-gpus "${SLURM_GPUS_PER_TASK}" \
        --block &

sleep 10

worker_num=$((SLURM_JOB_NUM_NODES - 1))


for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --nodes=1 --ntasks=1 -w "$node_i" \
        ray start --address "$ip_head" \
        --redis-password=$redis_password \
        --num-cpus "${SLURM_CPUS_PER_TASK}" \
        --num-gpus "${SLURM_GPUS_PER_TASK}" \
        --block &
    sleep 5
done

Any advice on how to resolve this so I can monitory my Ray progress would be greatly appreciated!

Update because I solved my own problem. So it seems like the tensorboard logs were being output to the supplied storage_path however, the readout in the SLURM output file doesn’t seem to be updated.

Possibly a bug where Tensorboard output being printed in Ray isn’t updated to match the actual output directory. Definitely worth a quick fix for new Ray users.