[Medium] Using docker image for service deployment

psydok · December 11, 2023, 11:56am

# Dockerfile for head
FROM python:3.10.13-slim

RUN apt-get update && apt-get install -y g++ gcc libsndfile1 git ffmpeg podman curl

RUN curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && apt-get update
RUN apt-get install -y nvidia-container-toolkit

RUN python -m pip install -U pip==23.3.1
RUN python -m pip install ray[default,serve]==2.8.0

ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

WORKDIR /root/ray/
COPY . .
ENTRYPOINT ["/root/ray/docker/entrypoint.sh"]

# entrypoint.sh
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
RAY_prestart_worker_first_driver=0.0 ray start --head --dashboard-host=0.0.0.0 --block

# service.py

import asyncio
from io import BytesIO

import numpy as np
import torch
from fastapi import FastAPI
from fastapi.responses import Response
from PIL import Image
from ray import serve
from ray.serve.handle import DeploymentHandle

app = FastAPI()


@serve.deployment(num_replicas=1)
@serve.ingress(app)
class APIIngressOD:
    def __init__(self, object_detection_handle) -> None:
        self.handle: DeploymentHandle = object_detection_handle.options(
            use_new_handle_api=True,
        )

    @app.get(
        "/",
        responses={200: {"content": {"image/jpeg": {}}}},
        response_class=Response,
    )
    async def detect(self, image_url: str):
        image = await self.handle.detect.remote(image_url)
        file_stream = BytesIO()
        image.save(file_stream, "jpeg")
        return Response(content=file_stream.getvalue(), media_type="image/jpeg")


@serve.deployment(
    ray_actor_options={"num_gpus": 0.25},
    autoscaling_config={"min_replicas": 2, "max_replicas": 4, "downscale_delay_s": 60},
)
class ObjectDetection:
    def __init__(self):
        self.model = torch.hub.load("ultralytics/yolov5", "yolov5s")
        self.model.cuda()

    async def detect(self, image_url: str):
        loop = asyncio.get_running_loop()
        result_im = await loop.run_in_executor(None, self.model, image_url)
        return Image.fromarray(result_im.render()[0].astype(np.uint8))


app = APIIngressOD.bind(ObjectDetection.bind())
serve.run(app, name="object_detection", route_prefix="/detect")

# Dockerfile for <ray-service>
FROM python:3.10.13-slim

RUN python -m pip install -U pip==23.3.1

COPY . .
RUN python -m pip install Pillow \
    opencv-python \
    torchvision>=0.16.* \
    numpy \
    torch \
    pandas \
    ray[serve]==2.8.0

docker build -t <ray-service> .
docker push <ray-service>
RAY_ADDRESS='http://localhost:8265' ray job submit --runtime-env-json '{"container": {"image": "<ray-service>:latest", "run_options": ["--tty", "--privileged", "--cap-drop ALL", "--log-level=debug", "--device nvidia.com/gpu=all", "--security-opt=label=disable",  "--restart unless-stopped"]}, "config": {"eager_install": false}, "env_vars":{"NVIDIA_VISIBLE_DEVICES": "all"}}' -- python service.py

I have had a few problems using this command:

I have to go into the main node container and pull the image beforehand. Otherwise, no matter how long I wait, the job remains uncompleted. Is this happening because of the timeout to complete the job? Is it possible to adjust this timeout?
After sending a job the images try to pull infinitely many times (raylet.err). As if there is no limit in attempts. That is, if we have not managed to pull the image in N time, I expect the job to go to failed status. But it stays pending forever. Is it possible to configure killing jobs that failed to start?
I also tried running my service on grpc with the image specified. Everything is fine. Requests go through on port 9000, but as soon as I deploy the service without image on port 8000 (specifying only dependencies via pip). The grpc service, which before I deployed another service was working, returns this response:
status = StatusCode.NOT_FOUND details = "Application metadata not set. Please ping /ray.serve.RayServeAPIService/ListApplications for available applications."

psydok · December 11, 2023, 5:51pm

I tried running the serve start --http-host 0.0.0.0 --grpc-port 9000 --grpc-servicer-functions <test_pb2_grpc.add_TestServicer_to_server> command before starting the services. Now I can’t get the second service up (after the grpc service specifying image):

RAY_ADDRESS='http://localhost:8265' ray job submit \
> --working-dir . \
> --runtime-env-json '{'pip': 'requirements.txt', 'config': {'eager_install': false}}' \
> -- python service.py

An error is returned:

runtime_env setup failed: Failed to set up runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data
    await self._waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/utils.py", line 91, in check_output_cmd
    stdout, _ = await proc.communicate()
  File "/usr/local/lib/python3.10/asyncio/subprocess.py", line 195, in communicate
    stdin, stdout, stderr = await tasks.gather(stdin, stdout, stderr)
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 366, in _create_runtime_env_with_retry
    runtime_env_context = await asyncio.wait_for(
  File "/usr/local/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/agent/runtime_env_agent.py", line 326, in _setup_runtime_env
    await create_for_plugin_if_needed(
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/plugin.py", line 254, in create_for_plugin_if_needed
    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/pip.py", line 518, in create
    bytes = await task
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/pip.py", line 498, in _create_for_hash
    await PipProcessor(
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/pip.py", line 400, in _run
    await self._install_pip_packages(
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/pip.py", line 376, in _install_pip_packages
    await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)
  File "/usr/local/lib/python3.10/site-packages/ray/_private/runtime_env/utils.py", line 93, in check_output_cmd
    raise RuntimeError(f"Run cmd[{cmd_index}] got exception.") from e
RuntimeError: Run cmd[9] got exception.

Gene · December 12, 2023, 9:57pm

Hi @psydok, I’m not sure about 1 and 2. But for 3 here simply meant you have multiple application deployed in Serve and that Serve’s gRPC doesn’t know which one to send to. You can pass an “application” metadata in your client. For an example do it see: Set Up a gRPC Service — Ray 2.8.1

cindy_zhang · December 22, 2023, 1:12am

Hi @psydok, some fixes and improvements were made to this runtime environment feature as part of Ray 2.9. Could you try it out and let me know if it fixes your issues? The full docs are here: Run Multiple Applications in Different Containers — Ray 2.9.0. Note that this is still an experimental feature, so if you have feature requests or run into issues, please submit an issue on Github!

psydok · December 25, 2023, 1:22pm

I seem to have managed to solve these problems in 2.8.1 by setting RAY_worker_register_timeout_seconds=1200.

Thank you, @Gene . I added specifying the application name to the metadata and it worked.

I am now trying to set up a cluster via docker environment on different servers. But it seems like specifying node-ip-address breaks everything, starting with the fact that I can’t view the node logs on the dashboardard.

psydok · December 25, 2023, 2:18pm

Runtime Env Agent timed out as NotFound in 30000ms. Status: NotFound: on_connect Connection refused, address: x.x.x.x, port: 19124, Suiciding...
or

> docker compose exec node serve start --proxy-location EveryNode \
        --http-host 0.0.0.0 --http-port 8099

2023-12-26 10:34:06,574 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: HEAD_IP:6379...
2023-12-26 10:34:06,591 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://HEAD_IP:8265 
[2023-12-26 10:34:15,600 E 252 282] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:CURRENT_EXTERNAL_IP_OF_NODE:35369: Failed to connect to remote host: Connection refused; RPC Error details:  .Please see `raylet.out` for more details.

psydok · December 27, 2023, 2:23pm

No, it’s become more of a problem.

github.com/ray-project/ray

[Serve] Ray 2.9.0 does not support service deployment using docker images via ray job

opened 09:30AM - 27 Dec 23 UTC

psydok

bug triage

### What happened + What you expected to happen > The `container` field of `run…time_env` is fixed in #40419 and will be included in Ray 2.9, which should be released today or tomorrow. Or you can try it today on the Ray nightly image. Let us know if you run into any issues! In version 2.8.1, I was able to debug the service startup with an image and everything started with this command: ```bash RAY_ADDRESS='http://localhost:8265' ray job submit \ --working-dir ./examples/mytest/ \ --runtime-env-json \ '{"container": {"image": "mytest:latest", "run_options": ["--tty", "--privileged", "--cap-drop ALL", "--log-level=debug", "--device nvidia.com/gpu=all", "--security-opt=label=disable", "--restart unless-stopped"]}, "config": {"eager_install": false}, "env_vars":{"NVIDIA_VISIBLE_DEVICES": "all"}}' \ -- python service.py ``` I updated Ray to 2.9.0 and am now getting errors: `ValueError: The 'container' field currently cannot be used together with other fields of runtime_env. Specified fields: dict_keys(['working_dir', 'container', 'env_vars', 'config'])` I definitely want to deploy services using the `ray job submit` command, as `serve deploy` - overwrites/deletes existing services. But the documentation doesn't say anything about the alternative method (https://docs.ray.io/en/latest/serve/advanced-guides/multi-app-container.html). Ray Cluster is deployed without using k8s and cloud systems. ### Versions / Dependencies python==3.11.5 ray[serve]==2.9.0 grpcio-tools==1.59.3 ### Reproduction script ```bash RAY_ADDRESS='http://localhost:8265' ray job submit \ --working-dir ./examples/mytest/ \ --runtime-env-json \ '{"container": {"image": "mytest:latest", "run_options": ["--tty", "--privileged", "--cap-drop ALL", "--log-level=debug", "--device nvidia.com/gpu=all", "--security-opt=label=disable", "--restart unless-stopped"]}, "config": {"eager_install": false}, "env_vars":{"NVIDIA_VISIBLE_DEVICES": "all"}}' \ -- python service.py ``` ### Issue Severity High: It blocks me from completing my task.

psydok · December 29, 2023, 7:28am

The only thing that helped me was to put firewalld on top of iptables on the server and restart docker. Before that it was just iptables. But I find this solution strange. I don’t understand why it worked.
If you have any suggestions, could you please share?

Topic		Replies	Views
Trying to deploy ray with docker Ray Serve	2	4087	February 16, 2021
How to run a RayService with container Runtime Environment on RayCluster Ray Clusters	1	766	June 8, 2023
When I run ray.remote with image_uri as the parameter, it is successful, but when I change it to ray_derve, it cannot run successfully	0	29	February 23, 2025
What is best practice for local setup? Ray Serve	4	4174	April 15, 2023
Kuberay sample RayService not launching serve apps Ray Serve	11	858	September 10, 2024

[Medium] Using docker image for service deployment

Related topics