Configuring Ray - NUMA Domain

Hello all,

I am currently working on optimizing the configuration of a Ray cluster for running large sets of remote-tasks while leveraging NUMA domains for enhance performance. My goal is to segment each node I acquired into 4 distinct NUMA domains in order to improve memory access and overall computation efficiency. Each node, for my use-case, has 96 cores and total of 4 NUMA domains (24 cores per NUMA ).

My current setup involves me setting numactl before calling ray start and properly pinning each worker-node to correct numa domain. I followed this post for guidance. This approach works for me, however, my questions is:

Does Ray have a way to restrict work (remote tasks) to run on specific NUMA domains without need to set numactl at start? If so , how can I achieve that and verify it is actually doing the right thing?

Thank you in advance! Any examples would be greatly appreciated !

For reference I am using:

  • Ray 2.8.1
  • CentOS 7

Hi @max_ronda so far I don’t think we did anything for numa support in ray. :frowning:

One possible thing you can try maybe is to start 4 containers and each one of them run on single NUMA domain? You can add label for each one and use the label to schedule the tasks on the specific numa domain.

Hi @yic , Thanks for the reply !

I am currently running on an HPC center , no containers. What I am doing, and please tell me if this is wrong, is starting a cluster this way:

My resources: Node has 96 cores , 4 NUMA domains (24 cores each) . I split my node into 4 parts this way:

Head node:

$ numactl -N 0 --membind=0 ray start --head --port=6379 --num-cpus=24 ... --block

Worker node 1 (Within same node)

$ numactl -N 1 --membind=1 ray start --address=HEAD_NODE_IP:6379 --num-cpus=24 ... --block

Worker node 2 (Within same node)

$ numactl -N 2 --membind=2 ray start --address=HEAD_NODE_IP:6379 --num-cpus=24 ... --block

Worker node 3 (Within same node)

numactl -N 3 --membind=3 ray start --address=HEAD_NODE_IP:6379 --num-cpus=24 ... --block

After, when I connect this cluster via ray.init() and running ray.nodes(), ray tells me I have acquired 4 nodes with total of 96 cores.

Then when running a simple task like this …

@ray.remote(num_cpus=24)
def run_task(task_id):

    # dummy wait here ... 
    _ = do_work(seconds=90) 
    
    affinity = list(os.sched_getaffinity(0))
    
    return affinity 

futures = [run_task.remote(i) for i in range(4)]
results = ray.get(futures)

It returns correct affinity for each task I submit .

I am thinking about scaling this solution to many more nodes, but curious if this is the right way of doing it or I should be thinking of something else ?

Thanks for your time and suggestions!

Hey @yic , any thoughts from above ?