File ray_bootstrap_config.yaml not found

soloman817 · November 4, 2021, 8:47am

Hi

I use ray 1.8.0 to setup a kubernetes cluster, everything seems fine, but in the dashboard I saw an error:

	FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/ray_bootstrap_config.yaml'

What is this file? any document on this file?

Dmitri · November 5, 2021, 12:05am

I think this error is likely benign (but we definitely need to get rid of the error message)

What configs did you use to deploy the cluster?

Dmitri · November 5, 2021, 12:09am

Also, how did you access this particular log file?

soloman817 · November 5, 2021, 9:10am

I just open the dashboard (not the experimental one), and in the Machines tab, it says there are some logs, and I got it.

I will come up later with the CRD details.

soloman817 · November 5, 2021, 9:15am

In dashboard, I just click this log link:

soloman817 · November 5, 2021, 9:16am

Here is the CRD I used:

apiVersion: cluster.ray.io/v1
headStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0
kind: RayCluster
metadata:
  name: jupyter-xiang--develop-7d8a4bb8
  ownerReferences:
  - apiVersion: batch/v1
    controller: true
    kind: Job
    name: jupyter-xiang--develop-7d8a4bb8-ray-gc
    uid: 56bc3907-2142-434e-9d48-51ec9931dbea
spec:
  headPodType: ray-head
  idleTimeoutMinutes: 5
  maxWorkers: 5
  podTypes:
  - maxWorkers: 0
    minWorkers: 0
    name: ray-head
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-head-
      spec:
        containers:
        - args:
          - 'trap : TERM INT; sleep infinity & wait;'
          command:
          - /bin/bash
          - -c
          - --
          env:
          - name: RAY_gcs_server_rpc_server_thread_num
            value: '1'
          image: rayproject/ray:1.8.0
          imagePullPolicy: Always
          name: ray-node
          ports:
          - containerPort: 6379
          - containerPort: 10001
          - containerPort: 8265
          - containerPort: 8000
          resources:
            limits:
              cpu: 1
              memory: 5Gi
            requests:
              cpu: 1
              memory: 5Gi
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        restartPolicy: Never
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
    rayResources:
      CPU: 0
  - maxWorkers: 5
    minWorkers: 1
    name: ray-worker-0
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-worker-0-
      spec:
        containers:
        - args:
          - 'trap : TERM INT; sleep infinity & wait;'
          command:
          - /bin/bash
          - -c
          - --
          env:
          - name: RAY_gcs_server_rpc_server_thread_num
            value: '1'
          image: rayproject/ray:1.8.0
          imagePullPolicy: Always
          name: ray-node
          ports:
          - containerPort: 6379
          - containerPort: 10001
          - containerPort: 8265
          - containerPort: 8000
          resources:
            limits:
              cpu: 1
              memory: 5Gi
            requests:
              cpu: 1
              memory: 5Gi
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        restartPolicy: Never
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
  upscalingSpeed: 1.0
workerStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

Dmitri · November 7, 2021, 4:10am

Hmm, just deployed with the above CR (with headRayStartCommands+workerRayStart commands moved into spec and the ownerReference removed).
When I accessed the dashboard, the head showed no available logs.

How are you deploying the operator?

soloman817 · November 8, 2021, 5:05pm

Thank you! But that is strange. I just use python kubernetes API to create it:

def create_cluster(client: _k8st.ClientType, namespace: str, manifest: dict, log=log, **kwargs):
    assert manifest["apiVersion"] == RAY_CLUSTER_API_VERSION
    assert manifest["kind"] == RAY_CLUSTER_KIND
    name = manifest["metadata"]["name"]
    assert namespace
    assert name
    client = _k8st.expand_client(client)
    api = _k8s.client.CustomObjectsApi(client)
    log.info(f"Creating ray cluster {namespace}:{name} ...")
    return api.create_namespaced_custom_object(
        namespace=namespace,
        body=manifest,
        **RAY_CLUSTER_CUSTOM_OBJECT_PARAMS,
        **kwargs,
    )

where the manifest is just a dict parsed from yaml.

I first generates the yaml files from helm 1.7.0, with onlyOperator enabled. Then I use kubectl to apply those files. When I upgrade to 1.8.0, I just change the operator Deployment, update the image to 1.8.0 then applied. This is the only difference I can image, I didn’t use helm to deploy it.

Dmitri · November 11, 2021, 6:39pm

The mystery is figuring out where the string “ray_bootstrap_config.yaml” came from.

The logs suggest that the Ray head could have been started with a command like
'ray start --head --autoscaling_config=“ray_bootstrap_config.yaml” ’ which would attempt to run the autoscaling monitor on the head node.

When running the K8s operator, the autoscaling monitor is run in the operator pod and not the head node so the command is
‘ray start --head --no-monitor’

The head start commands in the configs you’ve posted look valid, though…

Topic		Replies	Views
Unable to create cluster in kubernetes namespace Kubernetes	6	1326	March 29, 2021
Kubernetes cluster only creates head node Ray Clusters	11	798	June 7, 2022
Unable to recover from head-pod failure in k8s Ray Clusters	8	832	March 22, 2022
Ray head node on kubernetes fails to start Kubernetes	0	384	July 29, 2023
Kubernetes YAML files for manual deployment Ray Clusters	0	360	August 10, 2021

File ray_bootstrap_config.yaml not found

Related topics