Logging error computing CPU usage of Ray Kubernetes

Hi,
I am running a local install of Ray on a minikube environment on mac. When I try a deployment to Ray Serve (simple hello world endpoint), the deployment works, but I see logging error output that repeats over and over from raylet. This makes it difficult to find any other logging from the deployments. The logs repeat and seem to come from (raylet) or raylet, ip=<ip_address> depending on how many nodes are running.

I believe the 2 important things in the logging are:
FileNotFoundError: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.usage'
and
Message: 'Error computing CPU usage of Ray Kubernetes pod.'

Here the the full log block:

(raylet, ip=172.17.0.7) --- Logging error ---
(raylet, ip=172.17.0.7) Traceback (most recent call last):
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/k8s_utils.py", line 38, in cpu_percent
(raylet, ip=172.17.0.7)     cpu_usage = _cpu_usage()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/k8s_utils.py", line 64, in _cpu_usage
(raylet, ip=172.17.0.7)     return int(open(CPU_USAGE_PATH).read())
(raylet, ip=172.17.0.7) FileNotFoundError: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.usage'
(raylet, ip=172.17.0.7)
(raylet, ip=172.17.0.7) During handling of the above exception, another exception occurred:
(raylet, ip=172.17.0.7)
(raylet, ip=172.17.0.7) Traceback (most recent call last):
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/handlers.py", line 69, in emit
(raylet, ip=172.17.0.7)     if self.shouldRollover(record):
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/handlers.py", line 185, in shouldRollover
(raylet, ip=172.17.0.7)     msg = "%s\n" % self.format(record)
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 869, in format
(raylet, ip=172.17.0.7)     return fmt.format(record)
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 608, in format
(raylet, ip=172.17.0.7)     record.message = record.getMessage()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/logging/__init__.py", line 369, in getMessage
(raylet, ip=172.17.0.7)     msg = msg % self.args
(raylet, ip=172.17.0.7) TypeError: not all arguments converted during string formatting
(raylet, ip=172.17.0.7) Call stack:
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py", line 376, in <module>
(raylet, ip=172.17.0.7)     loop.run_until_complete(agent.run())
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/asyncio/base_events.py", line 574, in run_until_complete
(raylet, ip=172.17.0.7)     self.run_forever()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/asyncio/base_events.py", line 541, in run_forever
(raylet, ip=172.17.0.7)     self._run_once()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/asyncio/base_events.py", line 1786, in _run_once
(raylet, ip=172.17.0.7)     handle._run()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/asyncio/events.py", line 88, in _run
(raylet, ip=172.17.0.7)     self._context.run(self._callback, *self._args)
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 554, in run
(raylet, ip=172.17.0.7)     await self._perform_iteration(aioredis_client)
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 539, in _perform_iteration
(raylet, ip=172.17.0.7)     stats = self._get_all_stats()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 334, in _get_all_stats
(raylet, ip=172.17.0.7)     "cpu": self._get_cpu_percent(),
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 193, in _get_cpu_percent
(raylet, ip=172.17.0.7)     return k8s_utils.cpu_percent()
(raylet, ip=172.17.0.7)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/k8s_utils.py", line 57, in cpu_percent
(raylet, ip=172.17.0.7)     logger.exception("Error computing CPU usage of Ray Kubernetes pod.", e)
(raylet, ip=172.17.0.7) Message: 'Error computing CPU usage of Ray Kubernetes pod.'
(raylet, ip=172.17.0.7) Arguments: (FileNotFoundError(2, 'No such file or directory'),)

When I ssh into a node, the file is indeed not there. There is no cpuacct folder. Here is the ls for the folder /sys/fs/crgroup

(base) ray@ray-cluster-ray-worker-type-dz2mx:/sys/fs/cgroup$ ls -la
total 0
drwxr-xr-x 2 root root 0 Dec 13 11:18 .
drwxr-xr-x 9 root root 0 Dec 13 11:18 ..
-r--r--r-- 1 root root 0 Dec 13 11:18 cgroup.controllers
-r--r--r-- 1 root root 0 Dec 13 11:18 cgroup.events
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.freeze
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.max.depth
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.max.descendants
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.procs
-r--r--r-- 1 root root 0 Dec 13 11:18 cgroup.stat
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.subtree_control
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.threads
-rw-r--r-- 1 root root 0 Dec 13 11:18 cgroup.type
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpu.max
-r--r--r-- 1 root root 0 Dec 13 11:18 cpu.stat
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpu.weight
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpu.weight.nice
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpuset.cpus
-r--r--r-- 1 root root 0 Dec 13 11:18 cpuset.cpus.effective
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpuset.cpus.partition
-rw-r--r-- 1 root root 0 Dec 13 11:18 cpuset.mems
-r--r--r-- 1 root root 0 Dec 13 11:18 cpuset.mems.effective
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.current
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.events
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.events.local
-rw-r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.max
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.rsvd.current
-rw-r--r-- 1 root root 0 Dec 13 11:18 hugetlb.1GB.rsvd.max
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.current
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.events
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.events.local
-rw-r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.max
-r--r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.rsvd.current
-rw-r--r-- 1 root root 0 Dec 13 11:18 hugetlb.2MB.rsvd.max
-rw-r--r-- 1 root root 0 Dec 13 11:18 io.bfq.weight
-rw-r--r-- 1 root root 0 Dec 13 11:18 io.latency
-rw-r--r-- 1 root root 0 Dec 13 11:18 io.max
-r--r--r-- 1 root root 0 Dec 13 11:18 io.stat
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.current
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.events
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.events.local
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.high
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.low
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.max
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.min
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.oom.group
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.stat
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.swap.current
-r--r--r-- 1 root root 0 Dec 13 11:18 memory.swap.events
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.swap.high
-rw-r--r-- 1 root root 0 Dec 13 11:18 memory.swap.max
-r--r--r-- 1 root root 0 Dec 13 11:18 pids.current
-r--r--r-- 1 root root 0 Dec 13 11:18 pids.events
-rw-r--r-- 1 root root 0 Dec 13 11:18 pids.max
-r--r--r-- 1 root root 0 Dec 13 11:18 rdma.current
-rw-r--r-- 1 root root 0 Dec 13 11:18 rdma.max

Is there any way to correct this logging? Please let me know if you need more info to help out.

System info: Mac OSX 11.6.1
Docker Desktop: 4.3.0
Docker CLI: 20.10.11
Minikube: v1.22.0
Kubernetes: 1.20.7
Ray: 1.9.0 * Basic install from repo, but installed to the namespace ‘ray’

Hi @aggiemaster, sorry to hear you’re running into this! Do you know if it only happens with Ray Serve, or if it happens with other Ray jobs as well?

@Dmitri any ideas about this?

And thanks for including all the details :slight_smile:

It does look like we’ve made an assumption about the file system that didn’t work out here.

Would you mind posting a repro script and the versions of Mac OS and minikube?

Thanks for looking at the issue. I did a clean install on a new minikube this morning to validate. My install used the charts as is from the ray repo in ray/deploy/charts

On my machine, it appears the log issue is happening before installing any jobs or deployments into the cluster. I did not notice this before. I am able to see the logs in the Dashboard UI at http://localhost:8265. One interesting thing I have seen is that the dashboard UI no longer worked once I did a ray serve deployment, so the logs were tougher to get.

Here are the steps I used:

Intstall minikube:

minikube start \
    --kubernetes-version=v1.20.7 \
    --memory=8192 --cpus=4 --disk-size=20g \
    -p miniray

Install Ray:

cd ray/deploy/charts
helm -n ray install ray-cluster --create-namespace ./ray

Port forward to cluster:
kubectl -n ray port-forward service/ray-cluster-ray-head 10001:10001
kubectl -n ray port-forward service/ray-cluster-ray-head 8265:8265

System info: Mac OSX 11.6.1
Docker Desktop: 4.3.0
Docker CLI: 20.10.11
Minikube: v1.22.0
Kubernetes: 1.20.7
Ray: 1.9.0 * Basic install from repo

Here are some screenshots from the dashboard of the logs:

I went back and re-ran this with upgraded minikube and kubernetes versions and I am seeing the same issues on logging as well.

minikube v1.24.0 on Darwin 11.6.1
Kubernetes v1.22.3

Tracking on Github to make sure this isn’t lost:

Not yet able to replicate this.
Will try updating my Docker desktop.

@aggiemaster
Are you using the Docker driver for minikube?

Ok, after updating Docker desktop, I see the same issue.

@Dmitri Thank you for validating.

I just saw your first question. When I start minikube, I am not specifying the --driver flag, so I assume its using docker as the default. Here’s my start command i am using with minikube.

minikube start \
    --kubernetes-version=v1.20.7 \
    --memory=8192 --cpus=4 --disk-size=20g \
    -p miniray

While we work on adapting the Ray dashboard to cgroups v2, a workaround is to downgrade Docker desktop to 4.2.0

Thanks for the work around, I can confirm the error logging is not occurring with the 4.2 version of docker desktop.

To follow up, this issue was fixed in [dashboard][kubernetes] Dashboard CPU and memory adjustments. by DmitriGekhtman · Pull Request #21688 · ray-project/ray · GitHub . Thing should now work with the latest version of Docker desktop.

However, the first release including the fix is the recent Ray 1.12.0.