vLLM + Ray multi-node tensor-parallel deployment completely blocked by pending placement groups and raylet heartbeat failures

1. Severity of the issue:
High: Completely blocks me.

2. Environment:

  1. Ray version: 2.x
  2. vLLM version: 0.9.2
  3. Python version: 3.9
  4. OS / Container base: Linux (CentOS-based UBI8 in Kubernetes)
  5. Cloud / Infrastructure: AWS based Kubernetes cluster (pods scheduled on Tesla T4 GPU nodes)
  6. Other libs/tools:
    • CUDA 11.x

    • NCCL for cross-node GPU communication

**3. What happened vs. what you expected:
**

Expected Actual
--tensor-parallel-size 2 should shard my 16 GB Mamba-Codestral model across two 16 GB GPUs (one per pod) and launch successfully. Ray logs show a pending placement group for two 1-GPU bundles (one pinned to the head, one “anywhere”), never scheduling the second bundle. vLLM then errors out.
Worker pod joins the Ray cluster and stays Alive. The worker’s raylet process is repeatedly “marked dead” due to missed heartbeats (even on generous CPU/memory requests), then crashes its core-worker processes.
I set VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options":{"strategy":"SPREAD"}}' in env to force SPREAD placement. Ray still uses the default PACK strategy, trying to place both shards on one node → placement group unsatisfiable → vLLM blocks.

HEAD Pod Snippet:
NOTE : Pipeline Parallelism is not supported for the mamba2 arch

ray start --head \
    --disable-usage-stats \
    --include-dashboard=false \
    --port=6379 \
    --node-ip-address=$VLLM_HOST_IP \
    --node-manager-port=6380 \
    --object-manager-port=6381

export RAY_ADDRESS=$VLLM_HOST_IP:6379

# Wait until Ray sees 2 GPUs, then:
python3 -m vllm.entrypoints.openai.api_server \
    --model /model-cache/Mamba-Codestral-7B-v0.1 \
    --tensor-parallel-size 2 \
    --distributed-executor-backend ray

WORKER Pod Snippet:

export RAY_ADDRESS=vllm-head:6379

ray start --disable-usage-stats \
    --include-dashboard=false \
    --address=$RAY_ADDRESS \
    --node-ip-address=$VLLM_HOST_IP \
    --node-manager-port=6380 \
    --object-manager-port=6381 \
    --block

sleep infinity

KEY LOGS:

Warning: The number of required GPUs exceeds the total number of available GPUs in the placement group. specs=[{'node:10.42.22.33':0.001,'GPU':1.0},{'GPU':1.0}]
...
Total Demands:
 {'GPU':1.0,'node:10.42.22.33':0.001} * 1,
 {'GPU':1.0} * 1 (PACK): 1+ pending placement groups

Steps I’ve tried:

  1. Exported VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options":{"strategy":"SPREAD"}}' in both head & worker pods.

  2. Increased CPU/memory requests for the worker’s raylet.

  3. Tuned Ray’s health-check timeouts (health_check_period_ms, health_check_timeout_ms, etc.)

Despite all this, the second TP bundle never schedules (still PACK), and the worker raylet eventually dies of missed heartbeats.

Has anyone successfully run vLLM 0.9.2 + Ray 2.x in pure tensor-parallel multi-node mode? Any help is appreciated!