Running vllm script on multi node cluster

JayTea · January 24, 2024, 10:27pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hey, I have a weird issue with running vllm on to nodes in RAY.
So I have two nodes connected using ray start --head and `ray start --address=‘HOST_IP_ADDRESS’.

After I connect them, I run python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --tensor-parallel-size 2 as seen here.

However, the messages
2024-01-24 13:57:17,308 INFO worker.py:1540 – Connecting to existing Ray cluster at address: HOST_IP_ADDRESS…
2024-01-24 13:57:17,317 INFO worker.py:1715 – Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
INFO 01-24 13:57:39 llm_engine.py:70] Initializing an LLM engine with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)

But after that it hangs, and eventually quits.

Also, I know they are connected because I can see the message when I run ray status

======== Autoscaler status: 2024-01-24 14:21:29.814499 ========
Node status

Active:
1 node_25992b0acd7ed1a3eb4cef75f2a31569d0dfd958d9bd87b00b9a08e0
1 node_8f56678537c29df2ad7e19d8bfedf530fd4759d17ab729653e16bab3
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage:
0.0/8.0 CPU
2.0/2.0 GPU (2.0 used of 2.0 reserved in placement groups)
0B/2.96TiB memory
0B/37.25GiB object_store_memory

and cuda is installed because I can run nvidia-smi

update
After sometime it crashes and returns

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
(RayWorkerVllm pid=2445019) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
[2024-01-24 14:28:37,702 E 2442671 2445894] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
(RayWorkerVllm pid=2445019) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorkerVllm pid=2445019) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(RayWorkerVllm pid=2445019) [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
(RayWorkerVllm pid=2445019) [2024-01-24 14:28:37,723 E 2445019 2445893] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800371 milliseconds before timing out.
[2024-01-24 14:28:38,605 E 2442671 2445894] logging.cc:104: Stack trace:
/home/jvinolus/.conda/envs/api_test/lib/python3.10/site-packages/ray/_raylet.so(+0xfebb5a) [0x7f98c2944b5a] ray::operator<<()
/home/jvinolus/.conda/envs/api_test/lib/python3.10/site-packages/ray/_raylet.so(+0xfee298) [0x7f98c2947298] ray::TerminateHandler()
/mnt/home/jvinolus/.conda/envs/api_test/bin/…/lib/libstdc++.so.6(+0xb135a) [0x7f994c27435a] __cxxabiv1::__terminate()
/mnt/home/jvinolus/.conda/envs/api_test/bin/…/lib/libstdc++.so.6(+0xb13c5) [0x7f994c2743c5]
/mnt/home/jvinolus/.conda/envs/api_test/bin/…/lib/libstdc++.so.6(+0xb134f) [0x7f994c27434f]
/home/jvinolus/.conda/envs/api_test/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xc86f5b) [0x7f99076e3f5b] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/mnt/home/jvinolus/.conda/envs/api_test/bin/…/lib/libstdc++.so.6(+0xdbbf4) [0x7f994c29ebf4] execute_native_thread_routine
/lib64/libc.so.6(+0x9f802) [0x7f998789f802] start_thread
/lib64/libc.so.6(+0x3f450) [0x7f998783f450] __GI___clone3

*** SIGABRT received at time=1706135318 on cpu 26 ***
PC: @ 0x7f99878a154c (unknown) __pthread_kill_implementation
@ 0x7f9987854db0 (unknown) (unknown)
[2024-01-24 14:28:38,606 E 2442671 2445894] logging.cc:361: *** SIGABRT received at time=1706135318 on cpu 26 ***
[2024-01-24 14:28:38,606 E 2442671 2445894] logging.cc:361: PC: @ 0x7f99878a154c (unknown) __pthread_kill_implementation
[2024-01-24 14:28:38,606 E 2442671 2445894] logging.cc:361: @ 0x7f9987854db0 (unknown) (unknown)
Fatal Python error: Aborted

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, pyarrow.lib, pyarrow._hdfsio, pyarrow._json (total: 56)
Aborted (core dumped)

Dravi · February 9, 2024, 1:11pm

Did you ever resolve this?

I have exactly the same problem, verbatim to your problem.

I have also tried the “ray up example-full.yaml” - which pushes up a cluster, and appears to work but when I try “–tensor-parallel-size=4” to represent the 2 GPU on each machine, it tells me there are not enough GPU working.

This will work with --tensor-parallel-size=2 - but then it will work without using Ray for just 2 GPU…

Also, “ray down example-full.yaml” simply won’t work, it just hangs on "Destroying cluster. Confirm [y/N]: y - then goes to a new line and does nothing.

Opening a new terminal and typing in “ray status” clearly shows the ray cluster still running - and this is after 10 minutes of supposedly doing “ray down”.

You posted 16 days ago - and there is nothing in response?

Topic		Replies	Views
Start Ray cluster with error but working Ray Clusters	15	1045	July 4, 2022
VLLM will report gpu missing on the hosting node in Ray Ray Clusters	2	276	February 4, 2025
Problem connecting to GCP cluster Ray Clusters	2	77	September 17, 2024
Help with starting a local ray cluster? Ray Clusters	2	380	July 28, 2024
Ray cluster doesn't work, even connected well Ray Core	1	386	May 31, 2022

Running vllm script on multi node cluster

======== Autoscaler status: 2024-01-24 14:21:29.814499 ======== Node status

Resources

Related topics

======== Autoscaler status: 2024-01-24 14:21:29.814499 ========
Node status