Running vllm script on multi node cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hey, I have a weird issue with running vllm on to nodes in RAY.
So I have two nodes connected using ray start --head and `ray start --address=‘HOST_IP_ADDRESS’.

After I connect them, I run python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --tensor-parallel-size 2 as seen here.

However, the messages
2024-01-24 13:57:17,308 INFO worker.py:1540 – Connecting to existing Ray cluster at address: HOST_IP_ADDRESS…
2024-01-24 13:57:17,317 INFO worker.py:1715 – Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
INFO 01-24 13:57:39 llm_engine.py:70] Initializing an LLM engine with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)

But after that it hangs, and eventually quits.

Also, I know they are connected because I can see the message when I run ray status

======== Autoscaler status: 2024-01-24 14:21:29.814499 ========
Node status

Active:
1 node_25992b0acd7ed1a3eb4cef75f2a31569d0dfd958d9bd87b00b9a08e0
1 node_8f56678537c29df2ad7e19d8bfedf530fd4759d17ab729653e16bab3
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage:
0.0/8.0 CPU
2.0/2.0 GPU (2.0 used of 2.0 reserved in placement groups)
0B/2.96TiB memory
0B/37.25GiB object_store_memory

and cuda is installed because I can run nvidia-smi

update
After sometime it crashes and returns

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
(RayWorkerVllm pid=2445019) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
[2024-01-24 14:28:37,702 E 2442671 2445894] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
(RayWorkerVllm pid=2445019) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorkerVllm pid=2445019) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(RayWorkerVllm pid=2445019) [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
(RayWorkerVllm pid=2445019) [2024-01-24 14:28:37,723 E 2445019 2445893] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
[2024-01-24 14:28:38,605 E 2442671 2445894] logging.cc:104: Stack trace:
/home/jvinolus/.conda/envs/api_test/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x7f98c2944b5a] ray::operator<<()
/home/jvinolus/.conda/envs/api_test/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x7f98c2947298] ray::TerminateHandler()
/mnt/home/jvinolus/.conda/envs/api_test/bin/…/lib/libstdc++.so.6(+0xb135a) [0x7f994c27435a] __cxxabiv1::__terminate()
/mnt/home/jvinolus/.conda/envs/api_test/bin/…/lib/libstdc++.so.6(+0xb13c5) [0x7f994c2743c5]
/mnt/home/jvinolus/.conda/envs/api_test/bin/…/lib/libstdc++.so.6(+0xb134f) [0x7f994c27434f]
/home/jvinolus/.conda/envs/api_test/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xc86f5b) [0x7f99076e3f5b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/mnt/home/jvinolus/.conda/envs/api_test/bin/…/lib/libstdc++.so.6(+0xdbbf4) [0x7f994c29ebf4] execute_native_thread_routine
/lib64/libc.so.6(+0x9f802) [0x7f998789f802] start_thread
/lib64/libc.so.6(+0x3f450) [0x7f998783f450] __GI___clone3

*** SIGABRT received at time=1706135318 on cpu 26 ***
PC: @ 0x7f99878a154c (unknown) __pthread_kill_implementation
@ 0x7f9987854db0 (unknown) (unknown)
[2024-01-24 14:28:38,606 E 2442671 2445894] logging.cc:361: *** SIGABRT received at time=1706135318 on cpu 26 ***
[2024-01-24 14:28:38,606 E 2442671 2445894] logging.cc:361: PC: @ 0x7f99878a154c (unknown) __pthread_kill_implementation
[2024-01-24 14:28:38,606 E 2442671 2445894] logging.cc:361: @ 0x7f9987854db0 (unknown) (unknown)
Fatal Python error: Aborted

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, pyarrow.lib, pyarrow._hdfsio, pyarrow._json (total: 56)
Aborted (core dumped)

Did you ever resolve this?

I have exactly the same problem, verbatim to your problem.

I have also tried the “ray up example-full.yaml” - which pushes up a cluster, and appears to work but when I try “–tensor-parallel-size=4” to represent the 2 GPU on each machine, it tells me there are not enough GPU working.

This will work with --tensor-parallel-size=2 - but then it will work without using Ray for just 2 GPU…

Also, “ray down example-full.yaml” simply won’t work, it just hangs on "Destroying cluster. Confirm [y/N]: y - then goes to a new line and does nothing.

Opening a new terminal and typing in “ray status” clearly shows the ray cluster still running - and this is after 10 minutes of supposedly doing “ray down”.

You posted 16 days ago - and there is nothing in response?