How to direct worker logging to slurm outputs?

Hi, I have a question regarding logging when launching distributed jobs on slurm cluster:

  • first I allocate learner and actor nodes on slurm, wherein on each node I would run ray start and connect it to a ray head
#learner_ID=$(sbatch learner.sbatch)
learner_ID="$(sbatch learner.sbatch $MAIN_NODE $RAY_PORT | sed 's/Submitted batch job //')"
echo "learner slurm id is $learner_ID"
sleep 10

#actor_ID=$(sbatch actor.sbatch)
actor_ID="$(sbatch actor.sbatch $MAIN_NODE $RAY_PORT | sed 's/Submitted batch job //')"
echo "actor slurm id is $actor_ID"
sleep 10
  • how can I direct the output on learner and actor nodes to slurm output respectively?
import socket
import os
import ray

@ray.remote(scheduling_strategy="SPREAD", resources={"learner": 1})
def get_learner_socket_name():
    print('testing learner: this line should be in learners slurm output')
    return socket.gethostname()


@ray.remote(scheduling_strategy="SPREAD", resources={"actor": 1})
def get_actor_socket_name():
    print('testing actor: this line should be in learners slurm output')
    return socket_name



def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--address", required=True)
    args = parser.parse_args()
    print(f"Connecting to ray://{args.address}/")
    result = ray.init(address=args.address, log_to_driver=True)
    print(f"Connected to ray! {result}")
    print(ray.nodes())

    train_jobs = [get_learner_socket_name.remote() for _ in range(2)]
    train_results = [ray.get(j) for j in train_jobs]
    print(train_results)

    act_jobs = [get_actor_socket_name.remote() for _ in range(40)]
    act_results = [ray.get(j) for j in act_jobs]
    print(act_results)


if __name__ == "__main__":
    main()

Not familiar with Slurm. @sangcho do you know?

What do you mean by “slurm output”?

#!/bin/bash

#SBATCH -N 1
#SBATCH --tasks-per-node 1
#SBATCH --cpus-per-task 64
#SBATCH --constraint cpu
#SBATCH --array 1-10
#SBATCH --requeue
#SBATCH --output slurm_out/actor_%j_%a.log

MAIN_NODE=$1
RAY_PORT=$2

echo "Starting actor node on $MAIN_NODE:$RAY_PORT"

while true; do
    # Run the command in the background
    srun bash start_actor.sh $MAIN_NODE $RAY_PORT

    # Add a sleep period before re-running the command
    sleep 30
done

This is my actor slurm script (start_Actor.sh just launches ray start)
the slurm output refers to #SBATCH --output slurm_out/actor_%j_%a.log

Is there any update on this?

direct the output on learner and actor nodes

Not sure what this means…Have you taken a look at ray logging doc: Configuring Logging — Ray 2.6.1 ? Are you talking about the log directory /tmp/ray/session_*/logsof a Ray node?

Slurm output you are referring to is the stdout of the Slurm job?

Yes, I have and NONE of those solutions work for slurm.

This link explains how slurm direct its output: https://slurm.schedmd.com/sbatch.html

“By default both standard output and standard error are directed to a file of the name “slurm-%j.out”, where the “%j” is replaced with the job allocation number. The file will be generated on the first node of the job allocation. Other than the batch script itself, Slurm does no movement of user files.”

But when running way, this output is automatically disabled and ray’s logging take over, which is very very very very annoying when (1) no access to dashboard (not uncommon for HPC) (2) when I have many workers and would like to keep all the loggings.

Also note that the logging is in /tmp, meaning once slurm is done or job crushed and the node is released, the /tmp is gone too.

If we can simply disable ray logging and direct it back to standard output logging, then this issue is solved.

Have you tried to customize the python logger/handler and flush the output to stdout/stderr?
https://docs.ray.io/en/latest/ray-observability/user-guides/configure-logging.html#using-rays-logger