File ray_bootstrap_config.yaml not found

Hi

I use ray 1.8.0 to setup a kubernetes cluster, everything seems fine, but in the dashboard I saw an error:

	FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/ray_bootstrap_config.yaml'

What is this file? any document on this file?

image

I think this error is likely benign (but we definitely need to get rid of the error message)

What configs did you use to deploy the cluster?

Also, how did you access this particular log file?

I just open the dashboard (not the experimental one), and in the Machines tab, it says there are some logs, and I got it.

I will come up later with the CRD details.

In dashboard, I just click this log link:

Here is the CRD I used:

apiVersion: cluster.ray.io/v1
headStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0
kind: RayCluster
metadata:
  name: jupyter-xiang--develop-7d8a4bb8
  ownerReferences:
  - apiVersion: batch/v1
    controller: true
    kind: Job
    name: jupyter-xiang--develop-7d8a4bb8-ray-gc
    uid: 56bc3907-2142-434e-9d48-51ec9931dbea
spec:
  headPodType: ray-head
  idleTimeoutMinutes: 5
  maxWorkers: 5
  podTypes:
  - maxWorkers: 0
    minWorkers: 0
    name: ray-head
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-head-
      spec:
        containers:
        - args:
          - 'trap : TERM INT; sleep infinity & wait;'
          command:
          - /bin/bash
          - -c
          - --
          env:
          - name: RAY_gcs_server_rpc_server_thread_num
            value: '1'
          image: rayproject/ray:1.8.0
          imagePullPolicy: Always
          name: ray-node
          ports:
          - containerPort: 6379
          - containerPort: 10001
          - containerPort: 8265
          - containerPort: 8000
          resources:
            limits:
              cpu: 1
              memory: 5Gi
            requests:
              cpu: 1
              memory: 5Gi
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        restartPolicy: Never
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
    rayResources:
      CPU: 0
  - maxWorkers: 5
    minWorkers: 1
    name: ray-worker-0
    podConfig:
      apiVersion: v1
      kind: Pod
      metadata:
        generateName: ray-worker-0-
      spec:
        containers:
        - args:
          - 'trap : TERM INT; sleep infinity & wait;'
          command:
          - /bin/bash
          - -c
          - --
          env:
          - name: RAY_gcs_server_rpc_server_thread_num
            value: '1'
          image: rayproject/ray:1.8.0
          imagePullPolicy: Always
          name: ray-node
          ports:
          - containerPort: 6379
          - containerPort: 10001
          - containerPort: 8265
          - containerPort: 8000
          resources:
            limits:
              cpu: 1
              memory: 5Gi
            requests:
              cpu: 1
              memory: 5Gi
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        restartPolicy: Never
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
  upscalingSpeed: 1.0
workerStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379

Hmm, just deployed with the above CR (with headRayStartCommands+workerRayStart commands moved into spec and the ownerReference removed).
When I accessed the dashboard, the head showed no available logs.

How are you deploying the operator?

Thank you! But that is strange. I just use python kubernetes API to create it:

def create_cluster(client: _k8st.ClientType, namespace: str, manifest: dict, log=log, **kwargs):
    assert manifest["apiVersion"] == RAY_CLUSTER_API_VERSION
    assert manifest["kind"] == RAY_CLUSTER_KIND
    name = manifest["metadata"]["name"]
    assert namespace
    assert name
    client = _k8st.expand_client(client)
    api = _k8s.client.CustomObjectsApi(client)
    log.info(f"Creating ray cluster {namespace}:{name} ...")
    return api.create_namespaced_custom_object(
        namespace=namespace,
        body=manifest,
        **RAY_CLUSTER_CUSTOM_OBJECT_PARAMS,
        **kwargs,
    )

where the manifest is just a dict parsed from yaml.

I first generates the yaml files from helm 1.7.0, with onlyOperator enabled. Then I use kubectl to apply those files. When I upgrade to 1.8.0, I just change the operator Deployment, update the image to 1.8.0 then applied. This is the only difference I can image, I didn’t use helm to deploy it.

The mystery is figuring out where the string “ray_bootstrap_config.yaml” came from.

The logs suggest that the Ray head could have been started with a command like
'ray start --head --autoscaling_config=“ray_bootstrap_config.yaml” ’ which would attempt to run the autoscaling monitor on the head node.

When running the K8s operator, the autoscaling monitor is run in the operator pod and not the head node so the command is
‘ray start --head --no-monitor’

The head start commands in the configs you’ve posted look valid, though…