How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I want to deploy single task (decorated by ray.remote) (16 gpus needed) on both 2 nodes (8 gpus in each node). The task will hang as follows:
/workspace# ray status
2025-02-08 14:36:33,146 - INFO - Note: detected 192 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2025-02-08 14:36:33,147 - INFO - Note: NumExpr detected 192 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2025-02-08 14:36:33,147 - INFO - NumExpr defaulting to 16 threads.
======== Autoscaler status: 2025-02-08 14:36:31.736729 ========
Node status
---------------------------------------------------------------
Active:
1 node_9a8e4a65cdc318809db6101bfb99cae0e09295e4fe545ac266abefc3
1 node_9749a2bf68e5e5ceba3a489574454a0c6c4af617fc67f634b5b0e2a3
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/384.0 CPU
0.0/16.0 GPU
0B/3.56TiB memory
0B/372.53GiB object_store_memory
Demands:
{'GPU': 16.0}: 1+ pending tasks/actors
So when you specify a task requiring 16 GPUs, Ray tries to find a single node that can meet this requirement. Since your setup consists of two nodes with 8 GPUs each, there’s no single node with 16 GPUs, which causes your task to remain unscheduled and hang.
Ray schedules tasks based on resource requirements and what’s available on each node. Since no single node in your cluster has 16 GPUs, the task remains in a pending state.
Here are some possible solutions:
Modify Task Requirements – If possible, adjust your task to use fewer GPUs so it fits within a single node’s resources. For example, if your workload can be split into smaller tasks requiring 8 GPUs each, Ray can schedule them across both nodes.
Use Placement Groups – Ray supports placement groups to allocate resources across multiple nodes. You can create a placement group spanning both nodes and adjust your task to work within this structure.
Custom Scheduling Strategy – If your workload requires multiple nodes, you may need a custom scheduling approach using Ray’s actor model to distribute computation and manage state across nodes.
Check Resource Allocation – Make sure your cluster correctly recognizes and allocates all available GPUs. Sometimes, misconfigurations prevent Ray from detecting resources properly.
Here’s some more sources in the docs that might help you out.
Sources: