GPU Actors always pending with Ray Serve and Ray v2.0.0

ts-cfield · September 23, 2022, 9:34pm

We have been using Ray v1.13.0 and the Ray Serve component for our application
and are attempting to migrate to Ray v2. In short, Ray v2 is not working for us.
When we try to start a new actor that uses a GPU, it is stuck in the
PENDING_CREATION state. But, with Ray v1.13.0 the exact same code will spin up
a GPU actor and transition to the Alive state.

A greatly simplified (hopefully working and reproducible) version of the Ray v1.13 code:

#!/usr/bin/env python3

# Deploy script Ray v1

import logging
import os
import ray

ray.init(address="auto", namespace="serve")

api.Api.deploy(context)

#!/usr/bin/env python3

# Driver script, which is executed as `python3 -m driver` from within the docker container for the `CMD`.

import os

os.system("python3 -m deploy")

while True:
    time.sleep(5)

#!/usr/bin/env python3

# FastAPI app and Serve deployment class

from fastapi import FastAPI

app = FastApi()

@serve.deployment
@serve.ingress(app)
class Api:
    def __init__(self):
        self._actors: List[Dict] = {}
        self._next_id: int = 0
  
    @property
    def next_id(self) -> int:
        self._next_id = self._next_id + 1
        return self._next_id
   
    def initialized(self, id: int):
        self._actors.push(id)

    @app.post("/actor")
    def create_actor(self):
        deployment = ray.serve.get_deployment(self.__class__.__name__)
        this_actor = deployment.get_handle()
        id = self.next_id
        handle = GpuActor.remote(id, this_actor)
        self._actors.push({"id": id, "handle": handle})
        return id

#!/usr/bin/env python3

# GPU Actor

from ray.actor import ActorHandle

@ray.remote
class GpuActor:
    def __init__(self, id: int, manager: ActorHandle):
        self._id = id
        self._manager = manager
        self._manager.initialized.remote(self._id)
       
    def id(self) -> int:
        return self._id
    
    def run(self):
        pass

We are using the deployment class as an actor manager because we have other
endpoints and functionality to stop/spin down the GPU actors and we wanted to
minimize the number of “support” actors using up CPU resources, i.e. cores, in our resource constrained targets.

With Ray v1.13.0, the initialized remote function of the Api deployment class
will be executed by GpuActor. The Ray Dashboard will also show the
GPU Actor is alive and using a GPU. When we bump to Ray v2.0.0, the initialized
remote function will never be executed, and the Ray Dashboard will show a
PENDING_CREATION for the GPU Actor.

There is a little more to the configuration. The above Python code is run inside
a docker container with the following simplified entrypoint script.

# Entrypoint script for CPU-only head node docker container

## Display Version
ray --version

ray start --head --num-gpus=0

serve start --http-port=7777

exec "$@"

The head container/node does not have any GPUs. A separate GPU node is attached
to this Ray cluster with the following entrypoint script.

# Entrypoint script for GPU worker node docker container

# Display Version
ray --version

ray start --address=<HEAD NODE IP ADDRESS>

# python3 -c 'import ray; ray.init(address="auto"); print("Node initialized: {}".format(ray.is_initialized()))'

# Sleep
sleep infinity

With Ray v1.13, we are able to start the head node and “API” Python code. We
initially see no GPUs available. Then, we start up the GPU worker node and we
can see it added to the cluster in the Ray Dashboard. In one such environment,
the GPU worker node has two GPUs and we see both GPUs for the node in the Ray
Dashboard. With Ray v2.0.0, we see an identical configuration and display in the
Ray Dashboard.

We recognized that Ray v2.0.0 and the Ray Serve component have a new API and
deprecated much of the Ray Serve API and CLI that we are using in Ray v1.13. So,
we tried to migrate to Ray v2 following the migration guide. However, we are not
using the default HTTP port of 8000, but 7777, and we had to implement a
workaround based on this comment and issue. So, the Ray v2 deployment script looks like:

#!/usr/bin/env python3

# Deploy script Ray v2

import logging
import os
import ray

ray.init(address="auto", namespace="serve")

ray.serve.shutdown()

deployment = api.Api.options(route_prefix="/api").bind()

ray.serve.run(deployment, port=7777)

We removed serve start --http-port=7777 from the entrypoint script and we
removed route_prefix from the serve.deployment decorator for the Api
class. We left the deprecated get_deployment and get_handle usage because we
could not figure out how to replicate this functionality with the Ray v2 “bind”
API.

Despite migrating as best we could to the Ray v2 API, the initialized remote
method is never executed and a PENDING_CREATION actor is observed in the Ray
Dashboard.

Because of the issue with the HTTP port configuration, we could not use the
procedure recommended in the Ray v2 documentation for deployments with the CLI,
i.e., serve run deploy:api. We are not sure if the workaround for the HTTP
port is conflicting with our deployment implementation that is blocking us from
spinning up GPU Actors. CPU Actors appear to work as expected, but these are
spinning up on the head node. Again, our implementation and architecture works
great and is very stable with Ray v1.13.0.

Any help and/or information would be greatly appreciated. At the moment, this is blocking us from moving to Ray v2.

Joshuaalbert · January 15, 2023, 8:20pm

Same problem for us. Did you resolve this?

ts-cfield · January 15, 2023, 9:34pm

No, we still have not been able to resolve this issue. We have tried with v2.0.1 and v2.1.0, but we cannot get the cluster to recognize the GPU Actor is running and not pending. It appears our GPU actor is spun up, but the cluster just does not recognize the state change.

We are still using v1.13.

ts-cfield · February 3, 2023, 10:06pm

A short update, we still cannot get a GPU actor to initialize and be recognized by a cluster with Ray v2.X. We see in the Ray Dashboard an actor but it is always stuck “PENDING”. CPU only actors work just find.

andrew · February 6, 2023, 4:46pm

I am having this problem as well, I have twin services running our ray cluster ray serve, this has worked in the first service but I have not got it working on the second yet.

Ray and ray serve v2.1.0
All gpu worker pods stuck in permanent pending.

ts-cfield · February 6, 2023, 6:52pm

For reference, I have created an issue in Ray’s GitHub issue tracker: [Core|Serve] GPU Actors stuck in pending state · Issue #32222 · ray-project/ray · GitHub.

Topic		Replies	Views
With enough Available Resources, Most of the Actors' Creation is Pending Ray Core	5	528	December 6, 2021
Ray jobs stuck pending in docker container when using GPU on mnist example Ray Tune	4	941	July 12, 2021
Replicas can't connect to GPUs Ray Serve	9	1615	August 11, 2022
Issues with gpu usage when Ray Data is used in docker	1	249	June 14, 2023
Actor running on gpu Ray Core	1	426	August 4, 2022

GPU Actors always pending with Ray Serve and Ray v2.0.0

Related topics