vLLM + Ray multi-node tensor-parallel deployment completely blocked by pending placement groups and raylet heartbeat failures

NullUser · August 5, 2025, 5:06am

1. Severity of the issue:
High: Completely blocks me.

2. Environment:

Ray version: 2.x
vLLM version: 0.9.2
Python version: 3.9
OS / Container base: Linux (CentOS-based UBI8 in Kubernetes)
Cloud / Infrastructure: AWS based Kubernetes cluster (pods scheduled on Tesla T4 GPU nodes)
Other libs/tools:
- CUDA 11.x
- NCCL for cross-node GPU communication

**3. What happened vs. what you expected:
**

Expected	Actual
`--tensor-parallel-size 2` should shard my 16 GB Mamba-Codestral model across two 16 GB GPUs (one per pod) and launch successfully.	Ray logs show a pending placement group for two 1-GPU bundles (one pinned to the head, one “anywhere”), never scheduling the second bundle. vLLM then errors out.
Worker pod joins the Ray cluster and stays Alive.	The worker’s raylet process is repeatedly “marked dead” due to missed heartbeats (even on generous CPU/memory requests), then crashes its core-worker processes.
I set `VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options":{"strategy":"SPREAD"}}'` in env to force SPREAD placement.	Ray still uses the default PACK strategy, trying to place both shards on one node → placement group unsatisfiable → vLLM blocks.

HEAD Pod Snippet:
NOTE : Pipeline Parallelism is not supported for the mamba2 arch

ray start --head \
    --disable-usage-stats \
    --include-dashboard=false \
    --port=6379 \
    --node-ip-address=$VLLM_HOST_IP \
    --node-manager-port=6380 \
    --object-manager-port=6381

export RAY_ADDRESS=$VLLM_HOST_IP:6379

# Wait until Ray sees 2 GPUs, then:
python3 -m vllm.entrypoints.openai.api_server \
    --model /model-cache/Mamba-Codestral-7B-v0.1 \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

WORKER Pod Snippet:

export RAY_ADDRESS=vllm-head:6379

ray start --disable-usage-stats \
    --include-dashboard=false \
    --address=$RAY_ADDRESS \
    --node-ip-address=$VLLM_HOST_IP \
    --node-manager-port=6380 \
    --object-manager-port=6381 \
    --block

sleep infinity

KEY LOGS:

Warning: The number of required GPUs exceeds the total number of available GPUs in the placement group. specs=[{'node:10.42.22.33':0.001,'GPU':1.0},{'GPU':1.0}]
...
Total Demands:
 {'GPU':1.0,'node:10.42.22.33':0.001} * 1,
 {'GPU':1.0} * 1 (PACK): 1+ pending placement groups

Steps I’ve tried:

Exported VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options":{"strategy":"SPREAD"}}' in both head & worker pods.
Increased CPU/memory requests for the worker’s raylet.
Tuned Ray’s health-check timeouts (health_check_period_ms, health_check_timeout_ms, etc.)

Despite all this, the second TP bundle never schedules (still PACK), and the worker raylet eventually dies of missed heartbeats.

Has anyone successfully run vLLM 0.9.2 + Ray 2.x in pure tensor-parallel multi-node mode? Any help is appreciated!

Topic		Replies	Views
Running vllm script on multi node cluster Ray Clusters	1	3095	February 9, 2024
vLLM Inferencing on multiGPU Ray Serve	7	1267	September 24, 2024
Scaling Ray serve with vLLM beyond 2 GPUs Ray Serve	1	2419	February 5, 2024
VLLM will report gpu missing on the hosting node in Ray Ray Clusters	2	385	February 4, 2025
Ray actors cannot be scheduled due to resources constraints	19	2135	November 10, 2022

vLLM + Ray multi-node tensor-parallel deployment completely blocked by pending placement groups and raylet heartbeat failures

Related topics