Hi
I use ray 1.8.0 to setup a kubernetes cluster, everything seems fine, but in the dashboard I saw an error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/ray_bootstrap_config.yaml'
What is this file? any document on this file?
Dmitri
November 5, 2021, 12:05am
2
I think this error is likely benign (but we definitely need to get rid of the error message)
What configs did you use to deploy the cluster?
Dmitri
November 5, 2021, 12:09am
3
Also, how did you access this particular log file?
I just open the dashboard (not the experimental one), and in the Machines tab, it says there are some logs, and I got it.
I will come up later with the CRD details.
In dashboard, I just click this log link:
Here is the CRD I used:
apiVersion: cluster.ray.io/v1
headStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0
kind: RayCluster
metadata:
name: jupyter-xiang--develop-7d8a4bb8
ownerReferences:
- apiVersion: batch/v1
controller: true
kind: Job
name: jupyter-xiang--develop-7d8a4bb8-ray-gc
uid: 56bc3907-2142-434e-9d48-51ec9931dbea
spec:
headPodType: ray-head
idleTimeoutMinutes: 5
maxWorkers: 5
podTypes:
- maxWorkers: 0
minWorkers: 0
name: ray-head
podConfig:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-head-
spec:
containers:
- args:
- 'trap : TERM INT; sleep infinity & wait;'
command:
- /bin/bash
- -c
- --
env:
- name: RAY_gcs_server_rpc_server_thread_num
value: '1'
image: rayproject/ray:1.8.0
imagePullPolicy: Always
name: ray-node
ports:
- containerPort: 6379
- containerPort: 10001
- containerPort: 8265
- containerPort: 8000
resources:
limits:
cpu: 1
memory: 5Gi
requests:
cpu: 1
memory: 5Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
restartPolicy: Never
volumes:
- emptyDir:
medium: Memory
name: dshm
rayResources:
CPU: 0
- maxWorkers: 5
minWorkers: 1
name: ray-worker-0
podConfig:
apiVersion: v1
kind: Pod
metadata:
generateName: ray-worker-0-
spec:
containers:
- args:
- 'trap : TERM INT; sleep infinity & wait;'
command:
- /bin/bash
- -c
- --
env:
- name: RAY_gcs_server_rpc_server_thread_num
value: '1'
image: rayproject/ray:1.8.0
imagePullPolicy: Always
name: ray-node
ports:
- containerPort: 6379
- containerPort: 10001
- containerPort: 8265
- containerPort: 8000
resources:
limits:
cpu: 1
memory: 5Gi
requests:
cpu: 1
memory: 5Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
restartPolicy: Never
volumes:
- emptyDir:
medium: Memory
name: dshm
upscalingSpeed: 1.0
workerStartRayCommands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379
Dmitri
November 7, 2021, 4:10am
7
Hmm, just deployed with the above CR (with headRayStartCommands+workerRayStart commands moved into spec and the ownerReference removed).
When I accessed the dashboard, the head showed no available logs.
How are you deploying the operator?
Thank you! But that is strange. I just use python kubernetes API to create it:
def create_cluster(client: _k8st.ClientType, namespace: str, manifest: dict, log=log, **kwargs):
assert manifest["apiVersion"] == RAY_CLUSTER_API_VERSION
assert manifest["kind"] == RAY_CLUSTER_KIND
name = manifest["metadata"]["name"]
assert namespace
assert name
client = _k8st.expand_client(client)
api = _k8s.client.CustomObjectsApi(client)
log.info(f"Creating ray cluster {namespace}:{name} ...")
return api.create_namespaced_custom_object(
namespace=namespace,
body=manifest,
**RAY_CLUSTER_CUSTOM_OBJECT_PARAMS,
**kwargs,
)
where the manifest is just a dict parsed from yaml.
I first generates the yaml files from helm 1.7.0, with onlyOperator
enabled. Then I use kubectl to apply those files. When I upgrade to 1.8.0, I just change the operator Deployment, update the image to 1.8.0 then applied. This is the only difference I can image, I didn’t use helm to deploy it.
Dmitri
November 11, 2021, 6:39pm
9
The mystery is figuring out where the string “ray_bootstrap_config.yaml” came from.
The logs suggest that the Ray head could have been started with a command like
'ray start --head --autoscaling_config=“ray_bootstrap_config.yaml” ’ which would attempt to run the autoscaling monitor on the head node.
When running the K8s operator, the autoscaling monitor is run in the operator pod and not the head node so the command is
‘ray start --head --no-monitor’
The head start commands in the configs you’ve posted look valid, though…