Running Ray on Slurm Cluster

Hi, I am a new Ray user and am trying to run it on a slurm cluster. I have posted my slurm batch script below where I request 1 cpu per task, 1 task per node, 1 task per core, and 5 nodes. So ideally, after I run trainer.py attached below, I should see 4 nodes = (5 - 1 head node) scheduled and in the print output I should see that different nodes are being used. However, the ip address counter continues to show that only one of the nodes with ip address 10.1.0.32 is used by the processes. Any help would be appreciated!

SLURM BATCH SCRIPT

    #!/bin/bash
    #SBATCH --job-name=test
    #SBATCH --cpus-per-task=1
    #SBATCH --mem-per-cpu=1GB
    #SBATCH --nodes=5
    #SBATCH --ntasks-per-node=1
    #SBATCH --ntasks-per-core=1
    #SBATCH --time=12:00:00
    #SBATCH -C centos7 #Request only Centos7 nodes
    #SBATCH -p sched_mit_hill #Run on partition
    #SBATCH -o output_%j.txt #redirect output to output_JOBID.txt
    #SBATCH -e error_%j.txt #redirect errors to error_JOBID.txt
    #SBATCH --mail-type=BEGIN,END #Mail when job starts and ends
    #SBATCH --mail-user=gstepan@mit.edu #email recipient

    let "worker_num=(${SLURM_NTASKS} - 1)"

    # Define the total number of CPU cores available to ray
    let "total_cores=${worker_num} * ${SLURM_CPUS_PER_TASK}"

    module add engaging/anaconda/2.3.0
    module add engaging/Ray/2.3.1
    source activate py36

    nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the node names
    nodes_array=( $nodes )
    node1=${nodes_array[0]}

    ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname --ip-address) # Making address
    suffix=':6379'
    ip_head=$ip_prefix$suffix
    redis_password=$(uuidgen)
    export ip_head # Exporting for latter access by trainer.py

    srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head --port=6379 --redis-password=$redis_password & # Starting the head

    sleep 5

    # Make sure the head successfully starts before any worker does, otherwise
    # the worker will not be able to connect to redis. In case of longer delay,
    # adjust the sleeptime above to ensure proper order.

    for (( i=1; i<=$worker_num; i++ ))
        do
        node2=${nodes_array[$i]}
        srun --nodes=1 --ntasks=1 -w $node2 ray start --block --address=$ip_head --redis-password=$redis_password & # Starting the workers

        # Flag --block will keep ray process alive on each compute node.
        sleep 5
    done

    python -u trainer.py $redis_password ${total_cores} # Pass the total number of allocated CPUs

trainer.py CODE FILE

    from collections import Counter
    import os
    import sys
    import time
    import ray
    import psutil

    redis_password = sys.argv[1]
    num_cpus = int(sys.argv[2])

    ray.init(address=os.environ["ip_head"], _redis_password=redis_password)

    print("Nodes in the Ray cluster:")
    print(ray.nodes())

    print(ray.cluster_resources())

    @ray.remote(num_cpus=1)
    def f():
        print('hello')
        time.sleep(60)
        return ray._private.services.get_node_ip_address()

    # The following takes one second (assuming that ray was able to access all of the allocated nodes).
    for i in range(60):
        start = time.time()
        ip_addresses = ray.get([f.remote() for _ in range(num_cpus)])
        print(Counter(ip_addresses))
        end = time.time()
        print(end - start)

OUTPUT OF RUN

Nodes in the Ray cluster:

[{'NodeID': '0f318fe40c5c5e0360c67fe35c7708d7b031312310d42751b343ab7e', 'Alive': True, 'NodeManagerAddress': '10.1.0.32', 'NodeManagerHostname': 'node032', 'NodeManagerPort': 53181, 'ObjectManagerPort': 41626, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 60340, 'alive': True, 'Resources': {'memory': 739.0, 'node:10.1.0.32': 1.0, 'CPU': 16.0, 'object_store_memory': 255.0, 'GPU': 1.0}}, {'NodeID': '836df0105c707f63d0a315fc98769a5fd11b61c4907c17dbb59082df', 'Alive': True, 'NodeManagerAddress': '10.1.3.82', 'NodeManagerHostname': 'node382', 'NodeManagerPort': 47231, 'ObjectManagerPort': 39475, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 45048, 'alive': True, 'Resources': {'object_store_memory': 238.0, 'memory': 806.0, 'node:10.1.3.82': 1.0, 'CPU': 20.0}}, {'NodeID': '54d41a42d6d23cf3c5d90dc14e525ca698574036a89eeba472bfad47', 'Alive': True, 'NodeManagerAddress': '10.1.3.70', 'NodeManagerHostname': 'node370', 'NodeManagerPort': 54569, 'ObjectManagerPort': 43297, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 65395, 'alive': True, 'Resources': {'memory': 825.0, 'CPU': 20.0, 'node:10.1.3.70': 1.0, 'object_store_memory': 244.0}}, {'NodeID': '11a1f14ac4d5d998846f1e0e59491fad3464cbfb9df8b7a92d489e87', 'Alive': True, 'NodeManagerAddress': '10.1.3.83', 'NodeManagerHostname': 'node383', 'NodeManagerPort': 53786, 'ObjectManagerPort': 46172, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 57737, 'alive': True, 'Resources': {'memory': 854.0, 'object_store_memory': 252.0, 'node:10.1.3.83': 1.0, 'CPU': 20.0}}, {'NodeID': '72eaf797f9ca333ece9f1ae138022be20967d18a8f00bae513409fb2', 'Alive': True, 'NodeManagerAddress': '10.1.3.69', 'NodeManagerHostname': 'node369', 'NodeManagerPort': 62447, 'ObjectManagerPort': 43913, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_15-52-00_234710_13057/sockets/raylet', 'MetricsExportPort': 63544, 'alive': True, 'Resources': {'CPU': 20.0, 'object_store_memory': 248.0, 'memory': 840.0, 'node:10.1.3.69': 1.0}}]

{'CPU': 96.0, 'object_store_memory': 1237.0, 'node:10.1.0.32': 1.0, 'memory': 4064.0, 'GPU': 1.0, 'node:10.1.3.82': 1.0, 'node:10.1.3.70': 1.0, 'node:10.1.3.83': 1.0, 'node:10.1.3.69': 1.0}

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

67.03512859344482

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06458353996277

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.064570903778076

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06623387336731

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.07250666618347

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06637930870056

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06863617897034

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06762766838074

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06600785255432

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.065871477127075

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.064775228500366

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06629490852356

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.05808687210083

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.0690484046936

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.067530393600464

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.16660666465759

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.06879186630249

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.067288637161255

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.05646514892578

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Counter({'10.1.0.32': 4})

60.09117889404297

(pid=13246) hello

(pid=13248) hello

(pid=13247) hello

(pid=13249) hello

Hey @gstepan Can you post your sbatch script?

1 Like

Hi @rliaw, just posted it lmk if you have any questions!

Hi @gstepan, can you try doing this instead:

    start = time.time()
    ip_addresses = ray.get([f.remote() for _ in range(100)])
    print(Counter(ip_addresses))
    end = time.time()
    print(end - start)

@rliaw interesting, now it ends up using all 4 nodes. How did running 100 tasks as opposed to 3 make a difference? Is this how ray internally schedules the jobs?

Nodes in the Ray cluster:

[{'NodeID': '83d38bbc5265d0aa70802dad08d1e00d0b3e472c2e79580f4db5e610', 'Alive': True, 'NodeManagerAddress': '10.1.3.70', 'NodeManagerHostname': 'node370', 'NodeManagerPort': 56682, 'ObjectManagerPort': 42727, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 55813, 'alive': True, 'Resources': {'memory': 823.0, 'CPU': 20.0, 'object_store_memory': 243.0, 'node:10.1.3.70': 1.0}}, {'NodeID': '2150071854496dc3eeac161a1085ebeea504559ce54ca231697f2b15', 'Alive': True, 'NodeManagerAddress': '10.1.3.68', 'NodeManagerHostname': 'node368', 'NodeManagerPort': 64333, 'ObjectManagerPort': 42182, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 57051, 'alive': True, 'Resources': {'CPU': 20.0, 'object_store_memory': 236.0, 'node:10.1.3.68': 1.0, 'memory': 798.0}}, {'NodeID': 'b16f5d91e9ed3f49e499b0f4fe0f46ac0bf36f7d28c776efa50f7b14', 'Alive': True, 'NodeManagerAddress': '10.1.0.16', 'NodeManagerHostname': 'node016', 'NodeManagerPort': 37460, 'ObjectManagerPort': 32917, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 58112, 'alive': True, 'Resources': {'CPU': 16.0, 'node:10.1.0.16': 1.0, 'object_store_memory': 254.0, 'memory': 737.0}}, {'NodeID': 'ca1712ad11ae13cdf952de6720e154acb97d6f5ddb3e975af9edc40d', 'Alive': True, 'NodeManagerAddress': '10.1.3.69', 'NodeManagerHostname': 'node369', 'NodeManagerPort': 58760, 'ObjectManagerPort': 40246, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 64445, 'alive': True, 'Resources': {'node:10.1.3.69': 1.0, 'memory': 854.0, 'object_store_memory': 252.0, 'CPU': 20.0}}, {'NodeID': '3c53e326385fdb34c90e5fce88c6b0c5dd7b2940ef134ef6572ccb49', 'Alive': True, 'NodeManagerAddress': '10.1.3.67', 'NodeManagerHostname': 'node367', 'NodeManagerPort': 62025, 'ObjectManagerPort': 45781, 'ObjectStoreSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2021-01-28_19-55-18_860460_2289/sockets/raylet', 'MetricsExportPort': 46641, 'alive': True, 'Resources': {'memory': 826.0, 'node:10.1.3.67': 1.0, 'CPU': 20.0, 'object_store_memory': 244.0}}]

{'CPU': 96.0, 'node:10.1.3.70': 1.0, 'memory': 4038.0, 'object_store_memory': 1229.0, 'node:10.1.3.68': 1.0, 'node:10.1.0.16': 1.0, 'node:10.1.3.69': 1.0, 'node:10.1.3.67': 1.0}

(pid=2430) hello

(pid=2432) hello

(pid=2433) hello

(pid=2434) hello

(pid=2435) hello

(pid=2436) hello

(pid=2437) hello

(pid=2438) hello

(pid=2439) hello

(pid=2440) hello

(pid=2441) hello

(pid=2442) hello

(pid=2444) hello

(pid=2443) hello

(pid=2445) hello

(pid=2446) hello

(pid=20357, ip=10.1.3.67) hello

(pid=20355, ip=10.1.3.67) hello

(pid=20358, ip=10.1.3.67) hello

(pid=20361, ip=10.1.3.67) hello

(pid=20359, ip=10.1.3.67) hello

(pid=20356, ip=10.1.3.67) hello

(pid=20360, ip=10.1.3.67) hello

(pid=20362, ip=10.1.3.67) hello

(pid=20363, ip=10.1.3.67) hello

(pid=20364, ip=10.1.3.67) hello

(pid=20369, ip=10.1.3.67) hello

(pid=20368, ip=10.1.3.67) hello

(pid=20371, ip=10.1.3.67) hello

(pid=20367, ip=10.1.3.67) hello

(pid=20366, ip=10.1.3.67) hello

(pid=20365, ip=10.1.3.67) hello

(pid=20370, ip=10.1.3.67) hello

(pid=20372, ip=10.1.3.67) hello

(pid=20373, ip=10.1.3.67) hello

(pid=20374, ip=10.1.3.67) hello

(pid=768, ip=10.1.3.69) hello

(pid=6960, ip=10.1.3.68) hello

(pid=6961, ip=10.1.3.68) hello

(pid=6962, ip=10.1.3.68) hello

(pid=6963, ip=10.1.3.68) hello

(pid=6964, ip=10.1.3.68) hello

(pid=6965, ip=10.1.3.68) hello

(pid=6967, ip=10.1.3.68) hello

(pid=6966, ip=10.1.3.68) hello

(pid=6968, ip=10.1.3.68) hello

(pid=6969, ip=10.1.3.68) hello

(pid=6971, ip=10.1.3.68) hello

(pid=6970, ip=10.1.3.68) hello

(pid=6972, ip=10.1.3.68) hello

(pid=6973, ip=10.1.3.68) hello

(pid=6974, ip=10.1.3.68) hello

(pid=6975, ip=10.1.3.68) hello

(pid=6976, ip=10.1.3.68) hello

(pid=6982, ip=10.1.3.68) hello

(pid=6980, ip=10.1.3.68) hello

(pid=6984, ip=10.1.3.68) hello

(pid=769, ip=10.1.3.69) hello

(pid=774, ip=10.1.3.69) hello

(pid=770, ip=10.1.3.69) hello

(pid=773, ip=10.1.3.69) hello

(pid=772, ip=10.1.3.69) hello

(pid=771, ip=10.1.3.69) hello

(pid=775, ip=10.1.3.69) hello

(pid=776, ip=10.1.3.69) hello

(pid=777, ip=10.1.3.69) hello

(pid=778, ip=10.1.3.69) hello

(pid=779, ip=10.1.3.69) hello

(pid=780, ip=10.1.3.69) hello

(pid=781, ip=10.1.3.69) hello

(pid=784, ip=10.1.3.69) hello

(pid=783, ip=10.1.3.69) hello

(pid=782, ip=10.1.3.69) hello

(pid=785, ip=10.1.3.69) hello

(pid=787, ip=10.1.3.69) hello

(pid=786, ip=10.1.3.69) hello

(pid=7541, ip=10.1.3.70) hello

(pid=7544, ip=10.1.3.70) hello

(pid=7542, ip=10.1.3.70) hello

(pid=7550, ip=10.1.3.70) hello

(pid=7551, ip=10.1.3.70) hello

(pid=7548, ip=10.1.3.70) hello

(pid=7545, ip=10.1.3.70) hello

(pid=7546, ip=10.1.3.70) hello

(pid=7543, ip=10.1.3.70) hello

(pid=7549, ip=10.1.3.70) hello

(pid=7547, ip=10.1.3.70) hello

(pid=7552, ip=10.1.3.70) hello

(pid=7553, ip=10.1.3.70) hello

(pid=7554, ip=10.1.3.70) hello

(pid=7555, ip=10.1.3.70) hello

(pid=7556, ip=10.1.3.70) hello

(pid=7559, ip=10.1.3.70) hello

(pid=7557, ip=10.1.3.70) hello

(pid=7558, ip=10.1.3.70) hello

(pid=7560, ip=10.1.3.70) hello

(pid=2430) hello

(pid=2432) hello

(pid=2433) hello

(pid=2434) hello

Counter({'10.1.0.16': 20, '10.1.3.67': 20, '10.1.3.69': 20, '10.1.3.68': 20, '10.1.3.70': 20})

127.05818796157837

Yeah, basically, Ray will try to schedule jobs locally and only farm out if needed. Your previous workload was too lightweight and therefore would not actually require farming out. Hope that helps!

I see, @rliaw is there a reason why the pid is changing for processes running on the same node? In my slurm batch script I specified that my processes should only use on cpu per task and one task per node. So where is it getting these other cpus from to run several instance of the remote function on the same node? In fact, since I am only using 4 nodes (with the remote function f running for one minute), I expected that this code would have taken much longer than 127 seconds to terminate.

Also, as a follow up question, I was wondering what a raylet is. For example, when I ssh into one of the allocated nodes in the cluster, I see a raylet appear under the top command. But for some reason, top shows that all of my remote functions f() are running on only the head node. Is there a reason for this? Thanks!

I think you need to specify something like:

ray start --num-cpus=${SLURM_CPUS_PER_TASK}

in your ray start commands in order to isolate the job correctly and specify the right number of CPUs that Ray can use per node?

Oh I see, I was doing this in my call to @ray.remote(num_cpus=1)

That’s for each “ray task”, which is different than the SLURM CPUS per task, and also different from the number of CPUS that Ray is started with.

  • The Ray runtime will detect the CPUs on the node (checking os.cpu_count()). The Ray “runtime” is not the same as the ray “program” (which is a python script)
  • Thus, Ray basically ignores the SLURM CPUS per task (unless you specify --num-cpus)
  • Within the Ray program, Ray tasks are going to be scheduled according to the number of CPUs available in the Ray runtime.

Let me know if that helps!

Hi @rliaw thank you that makes a lot of sense! Just one final question, sorry!

Is there a reason why when I run top on my nodes in the cluster, I see a raylet appear but all of my remote functions f() are not displayed in top. I see all of my remote functions f() running only when I run top on the head node. Does ray somehow convert copies of the function f() into a raylet and that’s what it is running on the cluster nodes?

Thank you!

Hmm, the raylet just spawns workers that execute those remote functions. You should be able to see the remote functions if you run top on the other nodes.