ImportError: cannot import name 'Tensor' from 'torch' (unknown location)?

Unable to start llm model on cluster.

[aaa@rtx10-15 ~]$ python -m vllm.entrypoints.openai.api_server
–model /hgm/Qwen2.5-72B-Instruct/
–tokenizer /hgm/Qwen2.5-72B-Instruct/
–served-model-name Qwen2.5-72B-Instruct
–host 0.0.0.0
–port 3000
–gpu_memory-utilization 0.9
–tensor_parallel-size 2
–pipeline-parallel-size 7
–device cuda
INFO 11-23 22:46:23 api_server.py:585] vLLM API server version 0.6.4.post1
INFO 11-23 22:46:23 api_server.py:586] args: Namespace(host=‘0.0.0.0’, port=3000, uvicorn_log_level=‘info’, allow_credentials=False, allowed_origins=[‘‘], allowed_methods=[’’], allowed_headers=[‘*’], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role=‘assistant’, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=, return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin=‘’, model=‘/hgm/Qwen2.5-72B-Instruct/’, task=‘auto’, tokenizer=‘/hgm/Qwen2.5-72B-Instruct/’, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode=‘auto’, chat_template_text_format=‘string’, trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format=‘auto’, config_format=<ConfigFormat.AUTO: ‘auto’>, dtype=‘auto’, kv_cache_dtype=‘auto’, quantization_param_path=None, max_model_len=None, guided_decoding_backend=‘outlines’, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=7, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type=‘ray’, tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype=‘auto’, long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device=‘cuda’, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method=‘rejection_sampler’, typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=, preemption_mode=None, served_model_name=[‘Qwen2.5-72B-Instruct’], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy=‘fcfs’, override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 11-23 22:46:30 config.py:350] This model supports multiple tasks: {‘embedding’, ‘generate’}. Defaulting to ‘generate’.
INFO 11-23 22:46:31 config.py:1020] Defaulting to use ray for distributed inference
WARNING 11-23 22:46:31 arg_utils.py:1075] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 11-23 22:46:31 config.py:479] Async output processing can not be enabled with pipeline parallel
2024-11-23 22:46:31,728 INFO worker.py:1634 – Connecting to existing Ray cluster at address: 10.0.37.151:6379…
2024-11-23 22:46:31,735 INFO worker.py:1810 – Connected to Ray cluster. View the dashboard athttp://10.0.37.151:8265
INFO 11-23 22:46:32 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model=‘/hgm/Qwen2.5-72B-Instruct/’, speculative_config=None, tokenizer=‘/hgm/Qwen2.5-72B-Instruct/’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=7, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=‘outlines’), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-72B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)
INFO 11-23 22:46:32 ray_gpu_executor.py:134] use_ray_spmd_worker: False
Traceback (most recent call last):
File “”, line 198, in _run_module_as_main
File “”, line 88, in _run_code
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py”, line 643, in
uvloop.run(run_server(args))
File “/home/pnrusr/.local/lib/python3.11/site-packages/uvloop/init.py”, line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib64/python3.11/asyncio/runners.py”, line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
File “/home/pnrusr/.local/lib/python3.11/site-packages/uvloop/init.py”, line 61, in wrapper
return await main
^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py”, line 609, in run_server
async with build_async_engine_client(args) as engine_client:
File “/usr/lib64/python3.11/contextlib.py”, line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py”, line 113, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File “/usr/lib64/python3.11/contextlib.py”, line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py”, line 146, in build_async_engine_client_from_engine_args
engine_client = build_engine()
^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py”, line 691, in from_engine_args
engine = cls(
^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py”, line 578, in init
self.engine = self._engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py”, line 264, in init
super().init(*args, **kwargs)
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py”, line 347, in init
self.model_executor = executor_class(vllm_config=vllm_config, )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py”, line 512, in init
super().init(*args, **kwargs)
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py”, line 26, in init
super().init(*args, **kwargs)
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/executor/executor_base.py”, line 36, in init
self._init_executor()
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py”, line 65, in _init_executor
self._init_workers_ray(placement_group)
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py”, line 158, in _init_workers_ray
worker_ip = ray.get(worker.get_node_ip.remote())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/ray/_private/auto_init_hook.py”, line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py”, line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/ray/_private/worker.py”, line 2753, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/ray/_private/worker.py”, line 906, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::RayWorkerWrapper.init() (pid=13223, ip=10.0.62.243, actor_id=c14c84c2029731bc0647280e06000000, repr=<vllm.executor.ray_utils.FunctionActorManager._create_fake_actor_class..TemporaryActor object at 0x7f484412df50>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The actor with name RayWorkerWrapper failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:

ray::RayWorkerWrapper.init() (pid=13223, ip=10.0.62.243, actor_id=c14c84c2029731bc0647280e06000000, repr=<vllm.executor.ray_utils.FunctionActorManager._create_fake_actor_class..TemporaryActor object at 0x7f484412df50>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/init.py”, line 3, in
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/engine/arg_utils.py”, line 11, in
from vllm.config import (CacheConfig, ConfigFormat, DecodingConfig,
File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/config.py”, line 10, in
from transformers import PretrainedConfig
File “/home/pnrusr/.local/lib/python3.11/site-packages/transformers/init.py”, line 26, in
from . import dependency_versions_check
File “/home/pnrusr/.local/lib/python3.11/site-packages/transformers/dependency_versions_check.py”, line 16, in
from .utils.versions import require_version, require_version_core
File “/home/pnrusr/.local/lib/python3.11/site-packages/transformers/utils/init.py”, line 27, in
from .chat_template_utils import DocstringParsingException, TypeHintParsingException, get_json_schema
File “/home/pnrusr/.local/lib/python3.11/site-packages/transformers/utils/chat_template_utils.py”, line 39, in
from torch import Tensor
ImportError: cannot import name ‘Tensor’ from ‘torch’ (unknown location)
(TemporaryActor pid=13223, ip=10.0.62.243) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RayWorkerWrapper.init() (pid=13223, ip=10.0.62.243, actor_id=c14c84c2029731bc0647280e06000000, repr=<vllm.executor.ray_utils.FunctionActorManager._create_fake_actor_class..TemporaryActor object at 0x7f484412df50>)
(TemporaryActor pid=13223, ip=10.0.62.243) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=13223, ip=10.0.62.243) RuntimeError: The actor with name RayWorkerWrapper failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:
(TemporaryActor pid=13223, ip=10.0.62.243)
(TemporaryActor pid=13223, ip=10.0.62.243) ray::RayWorkerWrapper.init() (pid=13223, ip=10.0.62.243, actor_id=c14c84c2029731bc0647280e06000000, repr=<vllm.executor.ray_utils.FunctionActorManager._create_fake_actor_class..TemporaryActor object at 0x7f484412df50>)
(TemporaryActor pid=13223, ip=10.0.62.243) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(TemporaryActor pid=13223, ip=10.0.62.243) File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/init.py”, line 3, in
(TemporaryActor pid=13223, ip=10.0.62.243) from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
(TemporaryActor pid=13223, ip=10.0.62.243) File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/engine/arg_utils.py”, line 11, in
(TemporaryActor pid=13223, ip=10.0.62.243) from vllm.config import (CacheConfig, ConfigFormat, DecodingConfig,
(TemporaryActor pid=13223, ip=10.0.62.243) File “/home/pnrusr/.local/lib/python3.11/site-packages/vllm/config.py”, line 10, in
(TemporaryActor pid=13223, ip=10.0.62.243) from transformers import PretrainedConfig
(TemporaryActor pid=13223, ip=10.0.62.243) File “/home/pnrusr/.local/lib/python3.11/site-packages/transformers/init.py”, line 26, in
(TemporaryActor pid=13223, ip=10.0.62.243) from . import dependency_versions_check
(TemporaryActor pid=13223, ip=10.0.62.243) File “/home/pnrusr/.local/lib/python3.11/site-packages/transformers/dependency_versions_check.py”, line 16, in
(TemporaryActor pid=13223, ip=10.0.62.243) from .utils.versions import require_version, require_version_core
(TemporaryActor pid=13223, ip=10.0.62.243) File “/home/pnrusr/.local/lib/python3.11/site-packages/transformers/utils/init.py”, line 27, in
(TemporaryActor pid=13223, ip=10.0.62.243) from .chat_template_utils import DocstringParsingException, TypeHintParsingException, get_json_schema
(TemporaryActor pid=13223, ip=10.0.62.243) File “/home/pnrusr/.local/lib/python3.11/site-packages/transformers/utils/chat_template_utils.py”, line 39, in
(TemporaryActor pid=13223, ip=10.0.62.243) from torch import Tensor
(TemporaryActor pid=13223, ip=10.0.62.243) ImportError: cannot import name ‘Tensor’ from ‘torch’ (unknown location)

Cluster config
======= Autoscaler status: 2024-11-23 23:04:49.700460 ========
Node status

Active:
1 node_2f8705a2ca65023410fe8ead08c7f7eb6c834bfd1a649a3c4bc5d452
1 node_cfee29aa39b885919f8e848c792c0e2702cfc550e8fa170af33e75b2
1 node_a0c09a8a432355b663da89e51fd6d554835508afbd1429775edfb9c3
1 node_b876d47c7d685fc4671ae4126e68721dd82a0bc7735c45f335124f52
1 node_a03370ec7e5e9e241f6c85892d29540cda5342f20a19e03d0f08af13
1 node_0a36748d3c2a786d5ad84c6f7bfe5466959ef580ecca5ad6f573f4e0
1 node_7951e7c73eb1b174c007bbd2ee140f40270c2aa3e8b9231d62b42686
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage:
0.0/112.0 CPU
0.0/14.0 GPU
0B/147.22GiB memory
0B/64.40GiB object_store_memory
Demands:
(no resource demands)

Node config

Python 3.11.10 (main, Sep 9 2024, 00:00:00) [GCC 12.4.1 20240730 (RED SOFT 12.4.0-1)] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.version
‘2.5.1+cu124’
import ray
ray.version
‘2.39.0’
import vllm
v>>> vllm.version
‘0.6.4.post1’

Sat Nov 23 23:01:46 2024
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02 Driver Version: 555.58.02 CUDA Version: 12.5 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T10 Off | 00000000:01:00.0 Off | Off |
| N/A 35C P0 36W / 150W | 1MiB / 16384MiB | 6% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 Tesla T10 Off | 00000000:02:00.0 Off | Off |
| N/A 36C P0 37W / 150W | 1MiB / 16384MiB | 4% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
±----------------------------------------------------------------------------------------+

I tray repeat https://youtu.be/C9ObCXHE-Go?si=TwdKSfurUKUP3ZcV