Can single task or actor (remote) run on multiple nodes?

town · February 8, 2025, 8:08am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I want to deploy single task (decorated by ray.remote) (16 gpus needed) on both 2 nodes (8 gpus in each node). The task will hang as follows:

/workspace# ray status
2025-02-08 14:36:33,146 - INFO - Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-02-08 14:36:33,147 - INFO - Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-02-08 14:36:33,147 - INFO - NumExpr defaulting to 16 threads.
======== Autoscaler status: 2025-02-08 14:36:31.736729 ========
Node status
---------------------------------------------------------------
Active:
 1 node_9a8e4a65cdc318809db6101bfb99cae0e09295e4fe545ac266abefc3
 1 node_9749a2bf68e5e5ceba3a489574454a0c6c4af617fc67f634b5b0e2a3
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/16.0 GPU
 0B/3.56TiB memory
 0B/372.53GiB object_store_memory

Demands:
 {'GPU': 16.0}: 1+ pending tasks/actors

In this page (Scheduling — Ray 2.42.0), there seems limit:

For each task or actor, Ray will choose a node to run it and the scheduling decision is based on the following factors.

Does this limit indeed exist? Any solution?

christina · February 11, 2025, 6:55pm

Hi town, welcome to the Ray community!

So when you specify a task requiring 16 GPUs, Ray tries to find a single node that can meet this requirement. Since your setup consists of two nodes with 8 GPUs each, there’s no single node with 16 GPUs, which causes your task to remain unscheduled and hang.

Ray schedules tasks based on resource requirements and what’s available on each node. Since no single node in your cluster has 16 GPUs, the task remains in a pending state.

Here are some possible solutions:

Modify Task Requirements – If possible, adjust your task to use fewer GPUs so it fits within a single node’s resources. For example, if your workload can be split into smaller tasks requiring 8 GPUs each, Ray can schedule them across both nodes.
Use Placement Groups – Ray supports placement groups to allocate resources across multiple nodes. You can create a placement group spanning both nodes and adjust your task to work within this structure.
Custom Scheduling Strategy – If your workload requires multiple nodes, you may need a custom scheduling approach using Ray’s actor model to distribute computation and manage state across nodes.
Check Resource Allocation – Make sure your cluster correctly recognizes and allocates all available GPUs. Sometimes, misconfigurations prevent Ray from detecting resources properly.

Here’s some more sources in the docs that might help you out.
Sources:

town · March 10, 2025, 11:03am

tks, I will try placement groups.

Topic		Replies	Views
Number of tasks that can run on a single node Ray Core	1	365	January 31, 2022
Task distribution Ray Clusters	3	34	July 29, 2024
Requst can not be scheduled if the actor number is larger than the number of gpu Ray Core	6	296	March 23, 2023
Multi Nodes Multi GPU displayed but Trial not executed due to "No available node types can fulfill resource request" Ray Tune	2	85	July 15, 2024
How to assign tasks to node evenly Ray Clusters	1	585	January 18, 2023

Can single task or actor (remote) run on multiple nodes?

Related topics