torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.46.0
  • Python version: 3.10.16
  • OS: Ubuntu 22.04 LTS
  • Cloud/Infrastructure: Azure VMSS
  • Other libs/tools (if relevant):
  • VLLM Version : 0.8.5.post1

Hi I am trying to host Deepseek R1 as per this tutorial Serve DeepSeek — Ray 2.46.0.

Below is the config I have used for deploying model


http_options:
  host: 0.0.0.0
  port: 22300

applications:
- args:
    llm_configs:
        - model_loading_config:
            model_id: deepseek-ai/DeepSeek-R1
            model_source: /lustrefs/path_to_model/hf_hub/DeepSeek-R1
          deployment_config:
            autoscaling_config:
                min_replicas: 1
                max_replicas: 1
          runtime_env:
            env_vars:
                VLLM_USE_V1: "0"
          engine_kwargs:
                tensor_parallel_size: 8
                pipeline_parallel_size: 2
                gpu_memory_utilization: 0.8
                dtype: "auto"
                max_num_seqs: 20
                max_model_len: 8192
                enable_chunked_prefill: true
                enable_prefix_caching: true
                trust_remote_code: false
  import_path: ray.serve.llm:build_openai_app
  name: deepseek
  route_prefix: "/"

I am using 2 nodes with 8 H100 GPU’s on each node, however If I use above config to deploy I am came accross folllowing error.

 File "/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1714, in init_process_group
 store, rank, world_size = next(rendezvous_iterator)
 File "/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 226, in _tcp_rendezvous_handler
 store = _create_c10d_store(
 File "/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 194, in _create_c10d_store
 return TCPStore(
torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (10.199.1.40, 60605).

and If I set VLLM_USE_V1 to true within deployment config I see the following issue however I am able to serve models that fits on a single GPU.

This may be caused by a slow __init__ or reconfigure method.
ERROR 2025-05-25 15:39:01,027 controller 1599002 -- Exception in Replica(id='wsxcezc4', deployment='LLMDeployment:DeepSeek-R1', app='deepseek'), the replica will be stopped.
Traceback (most recent call last):
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 694, in check_ready
    ) = ray.get(self._ready_obj_ref)
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): e[36mray::ServeReplica:deepseek:LLMDeployment:DeepSeek-R1.initialize_and_get_metadata()e[39m (pid=1620132, ip=10.xxx.x.40, actor_id=35bac4d9f170d2e4a246035405000000, repr=<ray.serve._private.replica.ServeReplica:deepseek:LLMDeployment:DeepSeek-R1 object at 0x14719acf20e0>)
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 984, in initialize_and_get_metadata
    await self._replica_impl.initialize(deployment_config)
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 713, in initialize
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 690, in initialize
    self._user_callable_asgi_app = await asyncio.wrap_future(
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 1384, in initialize_callable
    await self._call_func_or_gen(
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 1347, in _call_func_or_gen
    result = await result
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 440, in __init__
    await asyncio.wait_for(self._start_engine(), timeout=ENGINE_START_TIMEOUT_S)
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/llm/_internal/serve/deployments/llm/llm_server.py", line 486, in _start_engine
    await self.engine.start()
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 232, in start
    self.engine = await self._start_engine()
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 272, in _start_engine
    return await self._start_engine_v1()
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 350, in _start_engine_v1
    return self._start_async_llm_engine(
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py", line 464, in _start_async_llm_engine
    return vllm.engine.async_llm_engine.AsyncLLMEngine(
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/dnc/arshad.shaikh/miniconda3/envs/rayenv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "

This issue has completely blocked me and my team from using Ray with multi-node setup, Can someone please help me out with this issue.

Regards,
Arshad

below are some additional logs

^^^^^^^^^^^^
File "/inc_users/arshad.shaikh/miniconda3/envs/bray/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 450, in __init__
self.utils = CudaUtils() # TODO: make static
^^^^^^^^^^^
File "/inc_users/arshad.shaikh/miniconda3/envs/bray/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/inc_users/arshad.shaikh/miniconda3/envs/bray/lib/python3.12/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/inc_users/arshad.shaikh/miniconda3/envs/bray/lib/python3.12/site-packages/triton/runtime/build.py", line 50, in _build
ret = subprocess.check_call(cc_cmd)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/inc_users/arshad.shaikh/miniconda3/envs/bray/lib/python3.12/subprocess.py", line 415, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpxude5cw6/main.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', '/tmp/tmpxude5cw6/cuda_utils.cpython-312-x86_64-linux-gnu.so', '-lcuda', '-L/inc_users/arshad.shaikh/miniconda3/envs/bray/lib/python3.12/site-packages/triton/backends/nvidia/lib', '-L/lib/x86_64-linux-gnu', '-I/inc_users/arshad.shaikh/miniconda3/envs/bray/lib/python3.12/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpxude5cw6', '-I/inc_users/arshad.shaikh/miniconda3/envs/bray/include/python3.12']' returned non-zero exit status 1

Hi @arshadshaikh , you need to set VLLM_USE_V1 to 1 in the env_vars.

Can you show fuller stack traces? What triggered the triton code paths in the additional logs in your second message? I suspect there are some incorrect environment setup.

Hi @arshadshaikh; could you also add env variables:

  • NCCL_DEBUG=TRACE
  • NCCL_DEBUG_SUBSYS=ALL
  • VLLM_LOGGING_LEVEL=DEBUG

+ NCCL version + version of GitHub - Azure/msccl-executor-nccl if your cluster uses it?