Running Ray on Slurm Cluster

gstepan · January 28, 2021, 9:13pm

Hi, I am a new Ray user and am trying to run it on a slurm cluster. I have posted my slurm batch script below where I request 1 cpu per task, 1 task per node, 1 task per core, and 5 nodes. So ideally, after I run trainer.py attached below, I should see 4 nodes = (5 - 1 head node) scheduled and in the print output I should see that different nodes are being used. However, the ip address counter continues to show that only one of the nodes with ip address 10.1.0.32 is used by the processes. Any help would be appreciated!

SLURM BATCH SCRIPT

    #!/bin/bash
    #SBATCH --job-name=test
    #SBATCH --cpus-per-task=1
    #SBATCH --mem-per-cpu=1GB
    #SBATCH --nodes=5
    #SBATCH --ntasks-per-node=1
    #SBATCH --ntasks-per-core=1
    #SBATCH --time=12:00:00
    #SBATCH -C centos7 #Request only Centos7 nodes
    #SBATCH -p sched_mit_hill #Run on partition
    #SBATCH -o output_%j.txt #redirect output to output_JOBID.txt
    #SBATCH -e error_%j.txt #redirect errors to error_JOBID.txt
    #SBATCH --mail-type=BEGIN,END #Mail when job starts and ends
    #SBATCH --mail-user=gstepan@mit.edu #email recipient

    let "worker_num=(${SLURM_NTASKS} - 1)"

    # Define the total number of CPU cores available to ray
    let "total_cores=${worker_num} * ${SLURM_CPUS_PER_TASK}"

    module add engaging/anaconda/2.3.0
    module add engaging/Ray/2.3.1
    source activate py36

    nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
    nodes_array=( $nodes )
    node1=${nodes_array[0]}

    ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) # Making address
    suffix=':6379'
    ip_head=$ip_prefix$suffix
    redis_password=$(uuidgen)
    export ip_head # Exporting for latter access by trainer.py

    srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head --port=6379 --redis-password=$redis_password & # Starting the head

    sleep 5

    # Make sure the head successfully starts before any worker does, otherwise
    # the worker will not be able to connect to redis. In case of longer delay,
    # adjust the sleeptime above to ensure proper order.

    for (( i=1; i<=$worker_num; i++ ))
        do
        node2=${nodes_array[$i]}
        srun --nodes=1 --ntasks=1 -w $node2 ray start --block --address=$ip_head --redis-password=$redis_password & # Starting the workers

        # Flag --block will keep ray process alive on each compute node.
        sleep 5
    done

    python -u trainer.py $redis_password ${total_cores} # Pass the total number of allocated CPUs

trainer.py CODE FILE

    from collections import Counter
    import os
    import sys
    import time
    import ray
    import psutil

    redis_password = sys.argv[1]
    num_cpus = int(sys.argv[2])

    ray.init(address=os.environ["ip_head"], _redis_password=redis_password)

    print("Nodes in the Ray cluster:")
    print(ray.nodes())

    print(ray.cluster_resources())

    @ray.remote(num_cpus=1)
    def f():
        print('hello')
        time.sleep(60)
        return ray._private.services.get_node_ip_address()

    # The following takes one second (assuming that ray was able to access all of the allocated nodes).
    for i in range(60):
        start = time.time()
        ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])
        print(Counter(ip_addresses))
        end = time.time()
        print(end - start)

OUTPUT OF RUN

Nodes in the Ray cluster:

[{'NodeID': '0f318fe40c5c5e0360c67fe35c7708d7b031312310d42751b343ab7e', 'Alive': True, 'NodeManagerAddress': '10.1.0.32', 'NodeManagerHostname': 'node032', 'NodeManagerPort': 53181, 'ObjectManagerPort': 41626, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 60340, 'alive': True, 'Resources': {'memory': 739.0, 'node:10.1.0.32': 1.0, 'CPU': 16.0, 'object_store_memory': 255.0, 'GPU': 1.0}}, {'NodeID': '836df0105c707f63d0a315fc98769a5fd11b61c4907c17dbb59082df', 'Alive': True, 'NodeManagerAddress': '10.1.3.82', 'NodeManagerHostname': 'node382', 'NodeManagerPort': 47231, 'ObjectManagerPort': 39475, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 45048, 'alive': True, 'Resources': {'object_store_memory': 238.0, 'memory': 806.0, 'node:10.1.3.82': 1.0, 'CPU': 20.0}}, {'NodeID': '54d41a42d6d23cf3c5d90dc14e525ca698574036a89eeba472bfad47', 'Alive': True, 'NodeManagerAddress': '10.1.3.70', 'NodeManagerHostname': 'node370', 'NodeManagerPort': 54569, 'ObjectManagerPort': 43297, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 65395, 'alive': True, 'Resources': {'memory': 825.0, 'CPU': 20.0, 'node:10.1.3.70': 1.0, 'object_store_memory': 244.0}}, {'NodeID': '11a1f14ac4d5d998846f1e0e59491fad3464cbfb9df8b7a92d489e87', 'Alive': True, 'NodeManagerAddress': '10.1.3.83', 'NodeManagerHostname': 'node383', 'NodeManagerPort': 53786, 'ObjectManagerPort': 46172, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 57737, 'alive': True, 'Resources': {'memory': 854.0, 'object_store_memory': 252.0, 'node:10.1.3.83': 1.0, 'CPU': 20.0}}, {'NodeID': '72eaf797f9ca333ece9f1ae138022be20967d18a8f00bae513409fb2', 'Alive': True, 'NodeManagerAddress': '10.1.3.69', 'NodeManagerHostname': 'node369', 'NodeManagerPort': 62447, 'ObjectManagerPort': 43913, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 63544, 'alive': True, 'Resources': {'CPU': 20.0, 'object_store_memory': 248.0, 'memory': 840.0, 'node:10.1.3.69': 1.0}}]

{'CPU': 96.0, 'object_store_memory': 1237.0, 'node:10.1.0.32': 1.0, 'memory': 4064.0, 'GPU': 1.0, 'node:10.1.3.82': 1.0, 'node:10.1.3.70': 1.0, 'node:10.1.3.83': 1.0, 'node:10.1.3.69': 1.0}

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

67.03512859344482

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06458353996277

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.064570903778076

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06623387336731

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.07250666618347

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06637930870056

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06863617897034

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06762766838074

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06600785255432

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.065871477127075

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.064775228500366

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06629490852356

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.05808687210083

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.0690484046936

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.067530393600464

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.16660666465759

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06879186630249

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.067288637161255

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.05646514892578

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.09117889404297

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

rliaw · January 28, 2021, 9:14pm

Hey @gstepan Can you post your sbatch script?

gstepan · January 28, 2021, 11:26pm

Hi @rliaw, just posted it lmk if you have any questions!

rliaw · January 29, 2021, 12:41am

gstepan:

    for i in range(60):
        start = time.time()
        ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])
        print(Counter(ip_addresses))
        end = time.time()
        print(end - start)

Hi @gstepan, can you try doing this instead:

    start = time.time()
    ip_addresses = ray.get([f.remote() for _ in range(100)])
    print(Counter(ip_addresses))
    end = time.time()
    print(end - start)

gstepan · January 29, 2021, 1:01am

@rliaw interesting, now it ends up using all 4 nodes. How did running 100 tasks as opposed to 3 make a difference? Is this how ray internally schedules the jobs?

Nodes in the Ray cluster:

[{'NodeID': '83d38bbc5265d0aa70802dad08d1e00d0b3e472c2e79580f4db5e610', 'Alive': True, 'NodeManagerAddress': '10.1.3.70', 'NodeManagerHostname': 'node370', 'NodeManagerPort': 56682, 'ObjectManagerPort': 42727, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 55813, 'alive': True, 'Resources': {'memory': 823.0, 'CPU': 20.0, 'object_store_memory': 243.0, 'node:10.1.3.70': 1.0}}, {'NodeID': '2150071854496dc3eeac161a1085ebeea504559ce54ca231697f2b15', 'Alive': True, 'NodeManagerAddress': '10.1.3.68', 'NodeManagerHostname': 'node368', 'NodeManagerPort': 64333, 'ObjectManagerPort': 42182, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 57051, 'alive': True, 'Resources': {'CPU': 20.0, 'object_store_memory': 236.0, 'node:10.1.3.68': 1.0, 'memory': 798.0}}, {'NodeID': 'b16f5d91e9ed3f49e499b0f4fe0f46ac0bf36f7d28c776efa50f7b14', 'Alive': True, 'NodeManagerAddress': '10.1.0.16', 'NodeManagerHostname': 'node016', 'NodeManagerPort': 37460, 'ObjectManagerPort': 32917, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 58112, 'alive': True, 'Resources': {'CPU': 16.0, 'node:10.1.0.16': 1.0, 'object_store_memory': 254.0, 'memory': 737.0}}, {'NodeID': 'ca1712ad11ae13cdf952de6720e154acb97d6f5ddb3e975af9edc40d', 'Alive': True, 'NodeManagerAddress': '10.1.3.69', 'NodeManagerHostname': 'node369', 'NodeManagerPort': 58760, 'ObjectManagerPort': 40246, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 64445, 'alive': True, 'Resources': {'node:10.1.3.69': 1.0, 'memory': 854.0, 'object_store_memory': 252.0, 'CPU': 20.0}}, {'NodeID': '3c53e326385fdb34c90e5fce88c6b0c5dd7b2940ef134ef6572ccb49', 'Alive': True, 'NodeManagerAddress': '10.1.3.67', 'NodeManagerHostname': 'node367', 'NodeManagerPort': 62025, 'ObjectManagerPort': 45781, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 46641, 'alive': True, 'Resources': {'memory': 826.0, 'node:10.1.3.67': 1.0, 'CPU': 20.0, 'object_store_memory': 244.0}}]

{'CPU': 96.0, 'node:10.1.3.70': 1.0, 'memory': 4038.0, 'object_store_memory': 1229.0, 'node:10.1.3.68': 1.0, 'node:10.1.0.16': 1.0, 'node:10.1.3.69': 1.0, 'node:10.1.3.67': 1.0}

(pid=2430) hello

(pid=2432) hello

(pid=2433) hello

(pid=2434) hello

(pid=2435) hello

(pid=2436) hello

(pid=2437) hello

(pid=2438) hello

(pid=2439) hello

(pid=2440) hello

(pid=2441) hello

(pid=2442) hello

(pid=2444) hello

(pid=2443) hello

(pid=2445) hello

(pid=2446) hello

(pid=20357, ip=10.1.3.67) hello

(pid=20355, ip=10.1.3.67) hello

(pid=20358, ip=10.1.3.67) hello

(pid=20361, ip=10.1.3.67) hello

(pid=20359, ip=10.1.3.67) hello

(pid=20356, ip=10.1.3.67) hello

(pid=20360, ip=10.1.3.67) hello

(pid=20362, ip=10.1.3.67) hello

(pid=20363, ip=10.1.3.67) hello

(pid=20364, ip=10.1.3.67) hello

(pid=20369, ip=10.1.3.67) hello

(pid=20368, ip=10.1.3.67) hello

(pid=20371, ip=10.1.3.67) hello

(pid=20367, ip=10.1.3.67) hello

(pid=20366, ip=10.1.3.67) hello

(pid=20365, ip=10.1.3.67) hello

(pid=20370, ip=10.1.3.67) hello

(pid=20372, ip=10.1.3.67) hello

(pid=20373, ip=10.1.3.67) hello

(pid=20374, ip=10.1.3.67) hello

(pid=768, ip=10.1.3.69) hello

(pid=6960, ip=10.1.3.68) hello

(pid=6961, ip=10.1.3.68) hello

(pid=6962, ip=10.1.3.68) hello

(pid=6963, ip=10.1.3.68) hello

(pid=6964, ip=10.1.3.68) hello

(pid=6965, ip=10.1.3.68) hello

(pid=6967, ip=10.1.3.68) hello

(pid=6966, ip=10.1.3.68) hello

(pid=6968, ip=10.1.3.68) hello

(pid=6969, ip=10.1.3.68) hello

(pid=6971, ip=10.1.3.68) hello

(pid=6970, ip=10.1.3.68) hello

(pid=6972, ip=10.1.3.68) hello

(pid=6973, ip=10.1.3.68) hello

(pid=6974, ip=10.1.3.68) hello

(pid=6975, ip=10.1.3.68) hello

(pid=6976, ip=10.1.3.68) hello

(pid=6982, ip=10.1.3.68) hello

(pid=6980, ip=10.1.3.68) hello

(pid=6984, ip=10.1.3.68) hello

(pid=769, ip=10.1.3.69) hello

(pid=774, ip=10.1.3.69) hello

(pid=770, ip=10.1.3.69) hello

(pid=773, ip=10.1.3.69) hello

(pid=772, ip=10.1.3.69) hello

(pid=771, ip=10.1.3.69) hello

(pid=775, ip=10.1.3.69) hello

(pid=776, ip=10.1.3.69) hello

(pid=777, ip=10.1.3.69) hello

(pid=778, ip=10.1.3.69) hello

(pid=779, ip=10.1.3.69) hello

(pid=780, ip=10.1.3.69) hello

(pid=781, ip=10.1.3.69) hello

(pid=784, ip=10.1.3.69) hello

(pid=783, ip=10.1.3.69) hello

(pid=782, ip=10.1.3.69) hello

(pid=785, ip=10.1.3.69) hello

(pid=787, ip=10.1.3.69) hello

(pid=786, ip=10.1.3.69) hello

(pid=7541, ip=10.1.3.70) hello

(pid=7544, ip=10.1.3.70) hello

(pid=7542, ip=10.1.3.70) hello

(pid=7550, ip=10.1.3.70) hello

(pid=7551, ip=10.1.3.70) hello

(pid=7548, ip=10.1.3.70) hello

(pid=7545, ip=10.1.3.70) hello

(pid=7546, ip=10.1.3.70) hello

(pid=7543, ip=10.1.3.70) hello

(pid=7549, ip=10.1.3.70) hello

(pid=7547, ip=10.1.3.70) hello

(pid=7552, ip=10.1.3.70) hello

(pid=7553, ip=10.1.3.70) hello

(pid=7554, ip=10.1.3.70) hello

(pid=7555, ip=10.1.3.70) hello

(pid=7556, ip=10.1.3.70) hello

(pid=7559, ip=10.1.3.70) hello

(pid=7557, ip=10.1.3.70) hello

(pid=7558, ip=10.1.3.70) hello

(pid=7560, ip=10.1.3.70) hello

(pid=2430) hello

(pid=2432) hello

(pid=2433) hello

(pid=2434) hello

Counter({'10.1.0.16': 20, '10.1.3.67': 20, '10.1.3.69': 20, '10.1.3.68': 20, '10.1.3.70': 20})

127.05818796157837

rliaw · January 29, 2021, 1:03am

Yeah, basically, Ray will try to schedule jobs locally and only farm out if needed. Your previous workload was too lightweight and therefore would not actually require farming out. Hope that helps!

gstepan · January 29, 2021, 1:07am

I see, @rliaw is there a reason why the pid is changing for processes running on the same node? In my slurm batch script I specified that my processes should only use on cpu per task and one task per node. So where is it getting these other cpus from to run several instance of the remote function on the same node? In fact, since I am only using 4 nodes (with the remote function f running for one minute), I expected that this code would have taken much longer than 127 seconds to terminate.

Also, as a follow up question, I was wondering what a raylet is. For example, when I ssh into one of the allocated nodes in the cluster, I see a raylet appear under the top command. But for some reason, top shows that all of my remote functions f() are running on only the head node. Is there a reason for this? Thanks!

rliaw · January 29, 2021, 1:11am

I think you need to specify something like:

ray start --num-cpus=${SLURM_CPUS_PER_TASK}

in your ray start commands in order to isolate the job correctly and specify the right number of CPUs that Ray can use per node?

gstepan · January 29, 2021, 1:11am

Oh I see, I was doing this in my call to @ray.remote(num_cpus=1)

rliaw · January 29, 2021, 2:23am

That’s for each “ray task”, which is different than the SLURM CPUS per task, and also different from the number of CPUS that Ray is started with.

The Ray runtime will detect the CPUs on the node (checking os.cpu_count()). The Ray “runtime” is not the same as the ray “program” (which is a python script)
Thus, Ray basically ignores the SLURM CPUS per task (unless you specify --num-cpus)
Within the Ray program, Ray tasks are going to be scheduled according to the number of CPUs available in the Ray runtime.

Let me know if that helps!

gstepan · January 29, 2021, 4:33pm

Hi @rliaw thank you that makes a lot of sense! Just one final question, sorry!

Is there a reason why when I run top on my nodes in the cluster, I see a raylet appear but all of my remote functions f() are not displayed in top. I see all of my remote functions f() running only when I run top on the head node. Does ray somehow convert copies of the function f() into a raylet and that’s what it is running on the cluster nodes?

Thank you!

rliaw · January 31, 2021, 8:11am

Hmm, the raylet just spawns workers that execute those remote functions. You should be able to see the remote functions if you run top on the other nodes.

Topic		Replies	Views
Ray on slurm - different ip addresses of worker nodes Ray Clusters	14	3661	August 29, 2023
Cluster dosen't distribute workloads Ray Clusters	0	403	November 10, 2022
Ray crashes on Slurm Ray Clusters	6	1399	October 27, 2022
Processing performance of tasks Ray Core	14	822	March 8, 2021
Ray on SLURM/HPC: starting worker nodes simultaneously Ray Clusters	10	2143	June 15, 2022

Running Ray on Slurm Cluster

Related topics