Worker pod can not be started

Rui · July 26, 2022, 7:54pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi Community,
We use ray helm chart ray/deploy/charts at master · ray-project/ray · GitHub to deploy Ray Cluster on Azure, but only ray-oprator and head-node pod were created, worker node disappeared, could anyone help us with this issue?
The detailed logs of ray operator can be found here: ray_operator_log.log - Google Drive
The main error says failed to connect to all addresses

Thanks in advance!

ckw017 · July 26, 2022, 8:11pm

Are you seeing the worker pod being created and then terminated?

cc @Dmitri

Rui · July 26, 2022, 8:36pm

no, I only saw operator and head

Dmitri · July 26, 2022, 9:43pm

If the operator is hitting issues, which it looks like it is, it is expected that we would have issues getting a worker.

@Rui
Are you deploying using the default configuration from Ray master?

Rui · July 27, 2022, 8:05am

@Dmitri I have made some changes to the values and added an Ingress yaml file:

# RayCluster settings:

image: ray/customImage

podTypes:
    rayHeadType:
        memory: 2Gi
        rayResources: {"CPU": 0}

    rayWorkerType:
        minWorkers: 1
        maxWorkers: 2
        memory: 8Gi
        CPU: 3

namespacedOperator: true
operatorImage: rayproject/ray:1.13.0

the custom Image is built up with this Dockerfile:

FROM rayproject/ray:1.13.0

USER root

RUN apt-get update && apt-get install -y \
  libzbar0 libglib2.0-0 libxrender1 libgl1-mesa-glx \
  libxext6 build-essential git tesseract-ocr libpq-dev gcc \
  && apt-get clean

RUN mkdir /app

USER ray

COPY requirements.txt /app/

RUN pip install --upgrade pip \
  && pip install --no-cache-dir -r /app/requirements.txt && sudo rm /app/requirements.txt \
  && pip install --no-cache-dir 'git+https://github.com/facebookresearch/detectron2.git'

COPY . /app/

WORKDIR /app

Rui · July 27, 2022, 12:10pm

We tried with default configuration with only setting namespacedOperator: true, still got same behaviour, the worker can not be created

Dmitri · July 27, 2022, 4:50pm

Thanks for the details. I will take a look.

Dmitri · July 27, 2022, 6:03pm

@Rui I was not able to reproduce the problem with the default configuration on a local kind cluster.

It is possible that the issue is related to network settings in your Kubernetes cluster. The operator needs to make rpc requests to the Ray head node, which has a server listening by default at port 6379.

Rui · July 27, 2022, 6:26pm

Thanks for the info, same with me, I also tried with local kind cluster. I will check network settings and keep you updated tomorrow.

Rui · July 29, 2022, 9:56am

@Dmitri After checking, I found we set the Kubernetes cluster networkPolicy to default-deny-all mode, so I think we need to add a network policy to ray chart. Are there any docs about which ports and connection of operator, head and worker pods need to be open?

Dmitri · August 3, 2022, 6:15pm

Between the workers and head pod: I believe we have a doc reference on ports in use – @Chen_Shen do you recall where that page is?

Between the Ray Operator and head pod: the Ray Operator must have access to the GCS Server port on the head pod (6379 by default.)

rliaw · August 7, 2022, 8:13am

Port configurations can be found here: Configuring Ray — Ray 1.13.0

Topic		Replies	Views
Kuberay cluster not create worker pods after ray operator update to 1.1.0 Kubernetes	0	426	March 29, 2024
Worker nodes fail to setup container Ray Clusters	1	704	September 12, 2022
Kubernetes cluster only creates head node Ray Clusters	11	784	June 7, 2022
Azure Get Cluster Status Ray Clusters	3	452	July 23, 2021
Ray starts head node succesfully but no workers (Azure) Ray Clusters	2	584	June 29, 2022

Worker pod can not be started

Related topics