Hi everyone,
I’m a beginner of RayServe. I’m following the example of ray serve llm (https://docs.ray.io/en/latest/serve/llm/serving-llms.html)
And it fails many ways. Here is my code (almost like the doc):
from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter
llm_config = LLMConfig(
model_loading_config=dict(
model_id="qwen-0.5b",
model_source="Qwen/Qwen2.5-0.5B-Instruct",
),
deployment_config=dict(
num_replicas=1,
# autoscaling_config=dict(
# min_replicas=1, max_replicas=2,
# ),
# ray_actor_options=dict(num_cpus=8, num_gpus=0.3),
),
# Pass the desired accelerator type (e.g. A10G, L4, etc.)
accelerator_type="P100",
# You can customize the engine arguments (e.g. vLLM engine kwargs)
engine_kwargs=dict(
distributed_executor_backend="ray",
tensor_parallel_size=2,
# pipeline_parallel_size=2
),
)
deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app)
My system has one node with 24 cpus and 2 P100 gpus. Start cluster:
ray start --head --num-cpus=24 --num-gpus=2
I tried to tune the config many ways, but it always have one of two errors:
- AttributeError: ‘AsyncEngineArgs’ object has no attribute ‘parallel_config’
- Repeat forever (> 1h ) three lines (the number maybe different depend on config tunnings):
- WARNING 2025-04-02 12:25:52,164 controller 20797 – Deployment ‘vLLM:qwen-0_5b’ in application ‘default’ has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{“GPU”: 0.3, “CPU”: 8.0}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}], total resources available: {}. Use
ray status
for more details. - (ServeController pid=20797) WARNING 2025-04-02 12:26:08,948 controller 20797 – Deployment ‘LLMRouter’ in application ‘default’ has 2 replicas that have taken more than 30s to initialize.
- (ServeController pid=20797) This may be caused by a slow init or reconfigure method.
I also try in another cluster which has two nodes, both has 40cpus and 1 GTX3090 per each. Run the same code above with tuning config, and got the same error. I also follow the recommend that TP_size according to n_gpus per node and PP_size according number of nodes, and it still not work.
I just want to get the demo work. But it can’t, please help to solve it.
Note: I think nothing wrong with the cluster. I followed the vllm tutorial https://docs.vllm.ai/en/latest/serving/distributed_serving.html
on this cluster and it work like a charm, i test with multiple requests and node’gpus can run well.
Sorry for my bad English, thank you guy.