1. Severity of the issue:
High: Completely blocks me.
2. Environment:
- Ray version: 2.x
- vLLM version: 0.9.2
- Python version: 3.9
- OS / Container base: Linux (CentOS-based UBI8 in Kubernetes)
- Cloud / Infrastructure: AWS based Kubernetes cluster (pods scheduled on Tesla T4 GPU nodes)
- Other libs/tools:
-
CUDA 11.x
-
NCCL for cross-node GPU communication
-
**3. What happened vs. what you expected:
**
Expected | Actual |
---|---|
--tensor-parallel-size 2 should shard my 16 GB Mamba-Codestral model across two 16 GB GPUs (one per pod) and launch successfully. |
Ray logs show a pending placement group for two 1-GPU bundles (one pinned to the head, one “anywhere”), never scheduling the second bundle. vLLM then errors out. |
Worker pod joins the Ray cluster and stays Alive. | The worker’s raylet process is repeatedly “marked dead” due to missed heartbeats (even on generous CPU/memory requests), then crashes its core-worker processes. |
I set VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options":{"strategy":"SPREAD"}}' in env to force SPREAD placement. |
Ray still uses the default PACK strategy, trying to place both shards on one node → placement group unsatisfiable → vLLM blocks. |
HEAD Pod Snippet:
NOTE : Pipeline Parallelism is not supported for the mamba2 arch
ray start --head \
--disable-usage-stats \
--include-dashboard=false \
--port=6379 \
--node-ip-address=$VLLM_HOST_IP \
--node-manager-port=6380 \
--object-manager-port=6381
export RAY_ADDRESS=$VLLM_HOST_IP:6379
# Wait until Ray sees 2 GPUs, then:
python3 -m vllm.entrypoints.openai.api_server \
--model /model-cache/Mamba-Codestral-7B-v0.1 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray
WORKER Pod Snippet:
export RAY_ADDRESS=vllm-head:6379
ray start --disable-usage-stats \
--include-dashboard=false \
--address=$RAY_ADDRESS \
--node-ip-address=$VLLM_HOST_IP \
--node-manager-port=6380 \
--object-manager-port=6381 \
--block
sleep infinity
KEY LOGS:
Warning: The number of required GPUs exceeds the total number of available GPUs in the placement group. specs=[{'node:10.42.22.33':0.001,'GPU':1.0},{'GPU':1.0}]
...
Total Demands:
{'GPU':1.0,'node:10.42.22.33':0.001} * 1,
{'GPU':1.0} * 1 (PACK): 1+ pending placement groups
Steps I’ve tried:
-
Exported
VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options":{"strategy":"SPREAD"}}'
in both head & worker pods. -
Increased CPU/memory requests for the worker’s raylet.
-
Tuned Ray’s health-check timeouts (
health_check_period_ms
,health_check_timeout_ms
, etc.)
Despite all this, the second TP bundle never schedules (still PACK), and the worker raylet eventually dies of missed heartbeats.
Has anyone successfully run vLLM 0.9.2 + Ray 2.x in pure tensor-parallel multi-node mode? Any help is appreciated!