Worker pod can not be started

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Community,
We use ray helm chart ray/deploy/charts at master · ray-project/ray · GitHub to deploy Ray Cluster on Azure, but only ray-oprator and head-node pod were created, worker node disappeared, could anyone help us with this issue?
The detailed logs of ray operator can be found here: ray_operator_log.log - Google Drive
The main error says failed to connect to all addresses

Thanks in advance!

Are you seeing the worker pod being created and then terminated?

cc @Dmitri

no, I only saw operator and head

If the operator is hitting issues, which it looks like it is, it is expected that we would have issues getting a worker.

@Rui
Are you deploying using the default configuration from Ray master?

@Dmitri I have made some changes to the values and added an Ingress yaml file:

# RayCluster settings:

image: ray/customImage

podTypes:
    rayHeadType:
        memory: 2Gi
        rayResources: {"CPU": 0}

    rayWorkerType:
        minWorkers: 1
        maxWorkers: 2
        memory: 8Gi
        CPU: 3

namespacedOperator: true
operatorImage: rayproject/ray:1.13.0

the custom Image is built up with this Dockerfile:

FROM rayproject/ray:1.13.0

USER root

RUN apt-get update && apt-get install -y \
  libzbar0 libglib2.0-0 libxrender1 libgl1-mesa-glx \
  libxext6 build-essential git tesseract-ocr libpq-dev gcc \
  && apt-get clean

RUN mkdir /app

USER ray

COPY requirements.txt /app/

RUN pip install --upgrade pip \
  && pip install --no-cache-dir -r /app/requirements.txt && sudo rm /app/requirements.txt \
  && pip install --no-cache-dir 'git+https://github.com/facebookresearch/detectron2.git'

COPY . /app/

WORKDIR /app

We tried with default configuration with only setting namespacedOperator: true, still got same behaviour, the worker can not be created

Thanks for the details. I will take a look.

@Rui I was not able to reproduce the problem with the default configuration on a local kind cluster.

It is possible that the issue is related to network settings in your Kubernetes cluster. The operator needs to make rpc requests to the Ray head node, which has a server listening by default at port 6379.

Thanks for the info, same with me, I also tried with local kind cluster. I will check network settings and keep you updated tomorrow.

@Dmitri After checking, I found we set the Kubernetes cluster networkPolicy to default-deny-all mode, so I think we need to add a network policy to ray chart. Are there any docs about which ports and connection of operator, head and worker pods need to be open?

Between the workers and head pod: I believe we have a doc reference on ports in use – @Chen_Shen do you recall where that page is?

Between the Ray Operator and head pod: the Ray Operator must have access to the GCS Server port on the head pod (6379 by default.)

Port configurations can be found here: Configuring Ray — Ray 1.13.0