Logging error computing CPU usage of Ray Kubernetes

aggiemaster · December 13, 2021, 11:49pm

Hi,
I am running a local install of Ray on a minikube environment on mac. When I try a deployment to Ray Serve (simple hello world endpoint), the deployment works, but I see logging error output that repeats over and over from raylet. This makes it difficult to find any other logging from the deployments. The logs repeat and seem to come from (raylet) or raylet, ip=<ip_address> depending on how many nodes are running.

I believe the 2 important things in the logging are:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.usage'
and
Message: 'Error computing CPU usage of Ray Kubernetes pod.'

Here the the full log block:

(raylet, ip=172.17.0.7) --- Logging error ---
(raylet, ip=172.17.0.7) Traceback (most recent call last):
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/k8s_utils.py", line 38, in cpu_percent
(raylet, ip=172.17.0.7)     cpu_usage = _cpu_usage()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/k8s_utils.py", line 64, in _cpu_usage
(raylet, ip=172.17.0.7)     return int(open(CPU_USAGE_PATH).read())
(raylet, ip=172.17.0.7) FileNotFoundError: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.usage'
(raylet, ip=172.17.0.7)
(raylet, ip=172.17.0.7) During handling of the above exception, another exception occurred:
(raylet, ip=172.17.0.7)
(raylet, ip=172.17.0.7) Traceback (most recent call last):
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/handlers.py", line 69, in emit
(raylet, ip=172.17.0.7)     if self.shouldRollover(record):
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/handlers.py", line 185, in shouldRollover
(raylet, ip=172.17.0.7)     msg = "%s\n" % self.format(record)
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 869, in format
(raylet, ip=172.17.0.7)     return fmt.format(record)
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 608, in format
(raylet, ip=172.17.0.7)     record.message = record.getMessage()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 369, in getMessage
(raylet, ip=172.17.0.7)     msg = msg % self.args
(raylet, ip=172.17.0.7) TypeError: not all arguments converted during string formatting
(raylet, ip=172.17.0.7) Call stack:
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py", line 376, in <module>
(raylet, ip=172.17.0.7)     loop.run_until_complete(agent.run())
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/asyncio/base_events.py", line 574, in run_until_complete
(raylet, ip=172.17.0.7)     self.run_forever()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/asyncio/base_events.py", line 541, in run_forever
(raylet, ip=172.17.0.7)     self._run_once()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/asyncio/base_events.py", line 1786, in _run_once
(raylet, ip=172.17.0.7)     handle._run()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/asyncio/events.py", line 88, in _run
(raylet, ip=172.17.0.7)     self._context.run(self._callback, *self._args)
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 554, in run
(raylet, ip=172.17.0.7)     await self._perform_iteration(aioredis_client)
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 539, in _perform_iteration
(raylet, ip=172.17.0.7)     stats = self._get_all_stats()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 334, in _get_all_stats
(raylet, ip=172.17.0.7)     "cpu": self._get_cpu_percent(),
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 193, in _get_cpu_percent
(raylet, ip=172.17.0.7)     return k8s_utils.cpu_percent()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/k8s_utils.py", line 57, in cpu_percent
(raylet, ip=172.17.0.7)     logger.exception("Error computing CPU usage of Ray Kubernetes pod.", e)
(raylet, ip=172.17.0.7) Message: 'Error computing CPU usage of Ray Kubernetes pod.'
(raylet, ip=172.17.0.7) Arguments: (FileNotFoundError(2, 'No such file or directory'),)

When I ssh into a node, the file is indeed not there. There is no cpuacct folder. Here is the ls for the folder /sys/fs/crgroup

(base) ray@ray-cluster-ray-worker-type-dz2mx:/sys/fs/cgroup$ ls -la
total 0
drwxr-xr-x 2 root root 0 Dec 13 11:18 .
drwxr-xr-x 9 root root 0 Dec 13 11:18 ..
-r--r--r-- 1 root root 0 Dec 13 11:18 cgroup.controllers
-r--r--r-- 1 root root 0 Dec 13 11:18 cgroup.events
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.freeze
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.max.depth
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.procs
-r--r--r-- 1 root root 0 Dec 13 11:18 cgroup.stat
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.threads
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.type
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpu.max
-r--r--r-- 1 root root 0 Dec 13 11:18 cpu.stat
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpu.weight
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpu.weight.nice
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpuset.cpus
-r--r--r-- 1 root root 0 Dec 13 11:18 cpuset.cpus.effective
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpuset.cpus.partition
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpuset.mems
-r--r--r-- 1 root root 0 Dec 13 11:18 cpuset.mems.effective
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.current
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.events
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.events.local
-rw-r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.max
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.rsvd.current
-rw-r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.rsvd.max
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.current
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.events
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.events.local
-rw-r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.max
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.rsvd.current
-rw-r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.rsvd.max
-rw-r--r-- 1 root root 0 Dec 13 11:18 io.bfq.weight
-rw-r--r-- 1 root root 0 Dec 13 11:18 io.latency
-rw-r--r-- 1 root root 0 Dec 13 11:18 io.max
-r--r--r-- 1 root root 0 Dec 13 11:18 io.stat
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.current
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.events
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.events.local
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.high
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.low
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.max
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.min
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.oom.group
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.stat
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.swap.current
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.swap.events
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.swap.high
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.swap.max
-r--r--r-- 1 root root 0 Dec 13 11:18 pids.current
-r--r--r-- 1 root root 0 Dec 13 11:18 pids.events
-rw-r--r-- 1 root root 0 Dec 13 11:18 pids.max
-r--r--r-- 1 root root 0 Dec 13 11:18 rdma.current
-rw-r--r-- 1 root root 0 Dec 13 11:18 rdma.max

Is there any way to correct this logging? Please let me know if you need more info to help out.

System info: Mac OSX 11.6.1
Docker Desktop: 4.3.0
Docker CLI: 20.10.11
Minikube: v1.22.0
Kubernetes: 1.20.7
Ray: 1.9.0 * Basic install from repo, but installed to the namespace ‘ray’

architkulkarni · December 15, 2021, 1:07am

Hi @aggiemaster, sorry to hear you’re running into this! Do you know if it only happens with Ray Serve, or if it happens with other Ray jobs as well?

@Dmitri any ideas about this?

architkulkarni · December 15, 2021, 1:08am

And thanks for including all the details

Dmitri · December 15, 2021, 4:11am

It does look like we’ve made an assumption about the file system that didn’t work out here.

Would you mind posting a repro script and the versions of Mac OS and minikube?

aggiemaster · December 15, 2021, 4:52pm

Thanks for looking at the issue. I did a clean install on a new minikube this morning to validate. My install used the charts as is from the ray repo in ray/deploy/charts

On my machine, it appears the log issue is happening before installing any jobs or deployments into the cluster. I did not notice this before. I am able to see the logs in the Dashboard UI at http://localhost:8265. One interesting thing I have seen is that the dashboard UI no longer worked once I did a ray serve deployment, so the logs were tougher to get.

Here are the steps I used:

Intstall minikube:

minikube start \
    --kubernetes-version=v1.20.7 \
    --memory=8192 --cpus=4 --disk-size=20g \
    -p miniray

Install Ray:

cd ray/deploy/charts
helm -n ray install ray-cluster --create-namespace ./ray

Port forward to cluster:
kubectl -n ray port-forward service/ray-cluster-ray-head 10001:10001
kubectl -n ray port-forward service/ray-cluster-ray-head 8265:8265

System info: Mac OSX 11.6.1
Docker Desktop: 4.3.0
Docker CLI: 20.10.11
Minikube: v1.22.0
Kubernetes: 1.20.7
Ray: 1.9.0 * Basic install from repo

Here are some screenshots from the dashboard of the logs:

aggiemaster · December 15, 2021, 5:33pm

I went back and re-ran this with upgraded minikube and kubernetes versions and I am seeing the same issues on logging as well.

minikube v1.24.0 on Darwin 11.6.1
Kubernetes v1.22.3

Dmitri · January 3, 2022, 7:31pm

Tracking on Github to make sure this isn’t lost:

github.com/ray-project/ray

[kubernetes][Bug] Error computing CPU usage on minikube

opened 07:31PM - 03 Jan 22 UTC

DmitriGekhtman

bug P1

### Search before asking - [X] I searched the [issues](https://github.com/ray-p…roject/ray/issues) and found no similar issues. ### Ray Component Ray Clusters ### What happened + What you expected to happen Refer to discussion: https://discuss.ray.io/t/logging-error-computing-cpu-usage-of-ray-kubernetes/4435 ### Versions / Dependencies Refer to discussion: https://discuss.ray.io/t/logging-error-computing-cpu-usage-of-ray-kubernetes/4435 ### Reproduction script Refer to discussion: https://discuss.ray.io/t/logging-error-computing-cpu-usage-of-ray-kubernetes/4435 ### Anything else _No response_ ### Are you willing to submit a PR? - [X] Yes I am willing to submit a PR!

Dmitri · January 11, 2022, 8:48pm

Not yet able to replicate this.
Will try updating my Docker desktop.

@aggiemaster
Are you using the Docker driver for minikube?

Dmitri · January 11, 2022, 9:34pm

Ok, after updating Docker desktop, I see the same issue.

aggiemaster · January 11, 2022, 10:53pm

@Dmitri Thank you for validating.

I just saw your first question. When I start minikube, I am not specifying the --driver flag, so I assume its using docker as the default. Here’s my start command i am using with minikube.

minikube start \
    --kubernetes-version=v1.20.7 \
    --memory=8192 --cpus=4 --disk-size=20g \
    -p miniray

Dmitri · January 12, 2022, 9:03pm

While we work on adapting the Ray dashboard to cgroups v2, a workaround is to downgrade Docker desktop to 4.2.0

aggiemaster · January 13, 2022, 1:23am

Thanks for the work around, I can confirm the error logging is not occurring with the 4.2 version of docker desktop.

Dmitri · April 12, 2022, 12:19am

To follow up, this issue was fixed in [dashboard][kubernetes] Dashboard CPU and memory adjustments. by DmitriGekhtman · Pull Request #21688 · ray-project/ray · GitHub . Thing should now work with the latest version of Docker desktop.

Dmitri · April 12, 2022, 12:24am

However, the first release including the fix is the recent Ray 1.12.0.

Topic		Replies	Views
Error computing CPU usage of Ray Kubernetes pod - Docker Desktop Ray Serve	4	1111	January 13, 2022
Deploying ray serve on Kubernetes Ray Serve	5	858	March 3, 2021
Reading logs on worker nodes Ray Tune	4	698	March 23, 2022
ray.exceptions.WorkerCrashedError Kubernetes	9	1957	August 22, 2022
[Core] Hosting actor as flask app - extra logging Ray Core	1	459	April 16, 2021

Logging error computing CPU usage of Ray Kubernetes

Related topics