Ray Serve LLM example in document cannot work

Hi everyone,
I’m a beginner of RayServe. I’m following the example of ray serve llm (https://docs.ray.io/en/latest/serve/llm/serving-llms.html)

And it fails many ways. Here is my code (almost like the doc):

from ray import serve
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",
    ),
    deployment_config=dict(
        num_replicas=1,
        # autoscaling_config=dict(
        #     min_replicas=1, max_replicas=2,
        # ),
        # ray_actor_options=dict(num_cpus=8, num_gpus=0.3),
    ),
    # Pass the desired accelerator type (e.g. A10G, L4, etc.)
    accelerator_type="P100",
    # You can customize the engine arguments (e.g. vLLM engine kwargs)
    engine_kwargs=dict(
        distributed_executor_backend="ray",
        tensor_parallel_size=2,
        # pipeline_parallel_size=2
    ),
)

deployment = LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)
llm_app = LLMRouter.as_deployment().bind([deployment])
serve.run(llm_app)

My system has one node with 24 cpus and 2 P100 gpus. Start cluster:

ray start --head --num-cpus=24 --num-gpus=2

I tried to tune the config many ways, but it always have one of two errors:

  1. AttributeError: ‘AsyncEngineArgs’ object has no attribute ‘parallel_config’
  2. Repeat forever (> 1h ) three lines (the number maybe different depend on config tunnings):
  • WARNING 2025-04-02 12:25:52,164 controller 20797 – Deployment ‘vLLM:qwen-0_5b’ in application ‘default’ has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: [{“GPU”: 0.3, “CPU”: 8.0}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}], total resources available: {}. Use ray status for more details.
  • (ServeController pid=20797) WARNING 2025-04-02 12:26:08,948 controller 20797 – Deployment ‘LLMRouter’ in application ‘default’ has 2 replicas that have taken more than 30s to initialize.
  • (ServeController pid=20797) This may be caused by a slow init or reconfigure method.

I also try in another cluster which has two nodes, both has 40cpus and 1 GTX3090 per each. Run the same code above with tuning config, and got the same error. I also follow the recommend that TP_size according to n_gpus per node and PP_size according number of nodes, and it still not work.

I just want to get the demo work. But it can’t, please help to solve it.

Note: I think nothing wrong with the cluster. I followed the vllm tutorial https://docs.vllm.ai/en/latest/serving/distributed_serving.html
on this cluster and it work like a charm, i test with multiple requests and node’gpus can run well.

Sorry for my bad English, thank you guy.

a few questions:

AttributeError: ‘AsyncEngineArgs’ object has no attribute ‘parallel_config’ suggests that there might be some vllm version mismatch? What is your vllm version? Can you try 0.7.2 and ray 2.44.0?

and
Resources required for each replica: [{“GPU”: 0.3, “CPU”: 8.0}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}, {“GPU”: 1.0, “accelerator_type:P100”: 0.001}], total resources available: {}. Use ray status for more details.

suggests that the P100 label is not defined properly for your cluster. If you only have this type of accelerator on a static cluster (without autoscaling) you can skip specifying the accelerator_type and it will fall back to the gpu label which will grab any gpu from your cluster. Can you try these and let me know how that goes?

also you don’t need to specify the distributed_executor_backend=“ray”. We will automatically use that since this is getting launched through ray serve llm.

Thanks for you response. My vllm is 0.8.2 and ray is 2.44.0. I tried to install latest version.

I tried to drop acceleration type and distributed_executor_backend=“ray”. And still not work.

I personally have a doubt in " total resources available: {}" . Does it mean my resources are not recognized? or ““accelerator_type:P100”: 0.001”. What does “0.001” means?

Beside, i have one more question, i don’t see GTX3090 in the supported GPU. So it mean i cannot use it right? I confuse because the vllm tutorial above work well on cluster using it.

yea so I think 0.8.2 by default switched to USE_VLLM_V1=“1”. We will support 0.8.2 in the next release of ray (2.45 most likely). You should either downgrade to 0.7.2 or specify USE_VLLM_V1="0" in runtime_envs.

llm_config = LLMConfig(runtime_env={"env_vars": {"USE_VLLM_V1": "0"}}, ...)

So there are two issues here:

The fact that P100 is not recognized is a separate issue. 0.001 forces ray’s scheduler to look for the resource labeled with accelerator_type:P100 for placement. If your cluster has not defined it in your cluster then you should not use it. Just don’t provide accelerator_type in LLMConfig. Can you show the content of ray status?

Hi, i downgraded version vllm==0.7.2 . ray==2.44.1. I’m run below code in two node cluster, 1 GTX3090 each node. My code like this:

import ray
from ray import serve
from ray.runtime_env import RuntimeEnv
from ray.serve.llm import LLMConfig, LLMServer, LLMRouter, ModelLoadingConfig, build_openai_app, LLMServingArgs

env = RuntimeEnv(
    # uv=['vllm','ray[serve,llm]'],
    env_vars=dict(
        # HF_HUB_ENABLE_HF_TRANSFER="1",
        USE_VLLM_V1="0")
)
ray.init("auto")
print(ray.available_resources())

llm_config = LLMConfig(
    model_loading_config=ModelLoadingConfig(
        model_id="qwen-0.5b",
        model_source="Qwen/Qwen2.5-0.5B-Instruct",),
    runtime_env=env,
    deployment_config=dict(
        num_replicas=1,),
    engine_kwargs=dict(
        tensor_parallel_size=1, # gpu per node
        pipeline_parallel_size=2), # n nodes
)
app = build_openai_app(LLMServingArgs(llm_configs=[llm_config]))
serve.run(app)

I summarize the results in order:

  1. The resource available printout: {‘GPU’: 2.0, ‘accelerator_type:G’: 2.0, ‘memory’: 145023374541.0, ‘CPU’: 78.0, ‘object_store_memory’: 51465784114.0, ‘node:internal_head’: 0.999, ‘node:172.16.212.217’: 1.0, ‘node:172.16.205.84’: 1.0}
  2. Error: (TemporaryActor pid=19034, ip=172.16.205.84) AttributeError: ‘ParallelConfig’ object has no attribute ‘world_size_across_dp’
  3. Error: (ServeController pid=257858) ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::RayWorkerWrapper.init() (pid=19034, ip=172.16.205.84, actor_id=8dcf211b16ddfcbd31d625f601000000, repr=<vllm.executor.ray_utils.FunctionActorManager._create_fake_actor_class..TemporaryActor object at 0x7efdc0333a30>)
  4. (ServeController pid=257858) ERROR 2025-04-03 04:57:26,812 controller 257858 – Exception in Replica(id=‘izml7trl’, deployment=‘LLMDeployment:qwen-0_5b’, app=‘default’), the replica will be stopped.
  5. (ServeController pid=257858) AttributeError: ‘AsyncEngineArgs’ object has no attribute ‘parallel_config’
  6. RuntimeError: Deploying application default failed: Failed to update the deployments [‘LLMDeployment:qwen-0_5b’].
#pyproject.toml
requires-python = ">=3.10"
dependencies = [
    "botocore>=1.37.23",
    "gymnasium>=1.1.1",
    "hf-transfer>=0.1.9",
    "openapi>=2.0.0",
    "pynvml>=12.0.0",
    "ray[llm,serve]>=2.44.1",
    "vllm==0.7.2",
    "xgrammar>=0.1.16",
]

In the end it still not work :frowning: .

@Hi_u_Bui_Nguy_n_Trun I cannot reproduce this issue.

  1. Does your setup work in single node? just change pipeline_parallel_size=1 and it should use head node resources.
  2. Can you print vllm version on both of your nodes? You can create a ray task that requires gpus and prints vllm version. Both should show the same. AttributeError: ‘AsyncEngineArgs’ object has no attribute ‘parallel_config’ This error still tells me vllm version is messed up.

@kourosh No, i use two node cluster, 1 GTX3090 per each. But doesn’t matter now, it’s my bad, I installed two different version of vllm in two node :smiley: . I try to downgrade both nodes, run script in head node and now it WORKS.

But now i have the little confuse. So Ray node use local library to run? If it’s that, how to i sync all dependencies for all nodes with the node i run code? I want do some thing like:
Head node: create venv, install ray and all deps, start ray head.
Worker node: create venv, install ray, start ray worker and all deps is install automatically.
Also i want to set env_vars in other nodes. Such as install vllm==8.x.x and set USE_VLLM_V1=“0”.
In other words, i want the node i run script can control all env of the remain.

Thank you