How to direct worker logging to slurm outputs?

wendyshang · August 24, 2023, 3:03am

Hi, I have a question regarding logging when launching distributed jobs on slurm cluster:

first I allocate learner and actor nodes on slurm, wherein on each node I would run ray start and connect it to a ray head

#learner_ID=$(sbatch learner.sbatch)
learner_ID="$(sbatch learner.sbatch $MAIN_NODE $RAY_PORT | sed 's/Submitted batch job //')"
echo "learner slurm id is $learner_ID"
sleep 10

#actor_ID=$(sbatch actor.sbatch)
actor_ID="$(sbatch actor.sbatch $MAIN_NODE $RAY_PORT | sed 's/Submitted batch job //')"
echo "actor slurm id is $actor_ID"
sleep 10

how can I direct the output on learner and actor nodes to slurm output respectively?

import socket
import os
import ray

@ray.remote(scheduling_strategy="SPREAD", resources={"learner": 1})
def get_learner_socket_name():
    print('testing learner: this line should be in learners slurm output')
    return socket.gethostname()


@ray.remote(scheduling_strategy="SPREAD", resources={"actor": 1})
def get_actor_socket_name():
    print('testing actor: this line should be in learners slurm output')
    return socket_name



def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--address", required=True)
    args = parser.parse_args()
    print(f"Connecting to ray://{args.address}/")
    result = ray.init(address=args.address, log_to_driver=True)
    print(f"Connected to ray! {result}")
    print(ray.nodes())

    train_jobs = [get_learner_socket_name.remote() for _ in range(2)]
    train_results = [ray.get(j) for j in train_jobs]
    print(train_results)

    act_jobs = [get_actor_socket_name.remote() for _ in range(40)]
    act_results = [ray.get(j) for j in act_jobs]
    print(act_results)


if __name__ == "__main__":
    main()

Huaiwei_Sun · August 24, 2023, 5:37am

Not familiar with Slurm. @sangcho do you know?

sangcho · August 24, 2023, 2:33pm

What do you mean by “slurm output”?

wendyshang · August 24, 2023, 3:11pm

#!/bin/bash

#SBATCH -N 1
#SBATCH --tasks-per-node 1
#SBATCH --cpus-per-task 64
#SBATCH --constraint cpu
#SBATCH --array 1-10
#SBATCH --requeue
#SBATCH --output slurm_out/actor_%j_%a.log

MAIN_NODE=$1
RAY_PORT=$2

echo "Starting actor node on $MAIN_NODE:$RAY_PORT"

while true; do
    # Run the command in the background
    srun bash start_actor.sh $MAIN_NODE $RAY_PORT

    # Add a sleep period before re-running the command
    sleep 30
done

This is my actor slurm script (start_Actor.sh just launches ray start)
the slurm output refers to #SBATCH --output slurm_out/actor_%j_%a.log

wendyshang · August 28, 2023, 4:33pm

Is there any update on this?

Huaiwei_Sun · August 30, 2023, 6:15am

direct the output on learner and actor nodes

Not sure what this means…Have you taken a look at ray logging doc: Configuring Logging — Ray 2.8.0 ? Are you talking about the log directory /tmp/ray/session_*/logsof a Ray node?

Slurm output you are referring to is the stdout of the Slurm job?

wendyshang · August 31, 2023, 5:40pm

Yes, I have and NONE of those solutions work for slurm.

This link explains how slurm direct its output: https://slurm.schedmd.com/sbatch.html

“By default both standard output and standard error are directed to a file of the name “slurm-%j.out”, where the “%j” is replaced with the job allocation number. The file will be generated on the first node of the job allocation. Other than the batch script itself, Slurm does no movement of user files.”

But when running way, this output is automatically disabled and ray’s logging take over, which is very very very very annoying when (1) no access to dashboard (not uncommon for HPC) (2) when I have many workers and would like to keep all the loggings.

Also note that the logging is in /tmp, meaning once slurm is done or job crushed and the node is released, the /tmp is gone too.

wendyshang · August 31, 2023, 5:41pm

If we can simply disable ray logging and direct it back to standard output logging, then this issue is solved.

Huaiwei_Sun · September 24, 2023, 10:06pm

Have you tried to customize the python logger/handler and flush the output to stdout/stderr?
https://docs.ray.io/en/latest/ray-observability/user-guides/configure-logging.html#using-rays-logger

Topic		Replies	Views
Running Ray on Slurm Cluster	11	1235	January 31, 2021
Unable to start Model Training with Ray Train on SLURM Cluster	2	229	December 17, 2023
Streaming logs from launched processes by ray workers Dashboard, Monitoring & Debugging	2	686	October 21, 2022
Redirect worker logs to the driver Dashboard, Monitoring & Debugging	11	1557	May 8, 2023
All worker objects need to write to one single file (which we use for logging for splunk) on server the application started to run Debugging and performance tuning	0	238	October 10, 2023

How to direct worker logging to slurm outputs?

Related topics