hi there, I recently run into a situation that the gcs_server process on my head node is taking up a core, but there’s no active task running at that moment.
And at this point, there’s nothing logged into gcs_server.out/err either.
My cluster is set up on k8s consisting of 100 cores (Ray version is 1.9). When the cluster is just started, the gcs_server seems working fine (and initially there’s only 10 cores). Then I launch my compute tasks, about 100 tasks, then with autoscaler, the cluster scales to 100 cores and finishes all tasks. Then I find out that gcs_server started being very busy with something.
I also notice that my tasks are scheduled to worker nodes very very slow, For example, I launch 100 tasks at the same time, and there are 100 cores. I would expect all cores are taken up pretty quickly, however, what I observed from dashboard is my 100 tasks get scheduled to worker in a speed of roughly 5 per second … which leads to a very low utilization of the grid. Not sure whether it’s because of the gcs_server …
Appreciate people could shed some lights and would love to provide more information.
Could you provide a reproduction script? That’s a very odd issue and potentially bug.
I also noticed you have a ServeController.listen_for_changes() actor running at a significant CPU utilization, maybe that could be related since Serve does monitor the cluster state through the GCS.
Unfortunately, I can’t reproduce it in a reasonable small example. And in fact, I only saw this from our test environments. I tried setting up a k8s cluster by rancher and provided similar workload, but can’t reproduce the issue.
Here’s the result of ray status I just run from my head node. At this point, there’s no active work going on, and there’s 100 workers being brought up 20 min earlier (still standing by).
One more thing to mention is that our head node resource configuration (cpu & memory) is just like the worker, which might not be sufficient? What’s the recommended config for head?
Btw, when you say 100 cores, do you mean each node (“worker” here) has 1 core? Generally we recommend allocating as large a node as possible (e.g., 32 cores per node). This is since each node adds scheduilng overhead.
Is this the only one that shows up when you take a stack sample? If so, then maybe it could be the ServeController polling resources, though I couldn’t confirm this based on the stack trace.
@simon-mo@eoakes on a 100-node cluster, would it be expected that ServeController could cause a lot of GCS load polling cluster status or something?
hi @simon-mo@ericl , I give it a try by increasing CONTROL_LOOP_PERIOD_S. It indeed helps reduce the cpu utilization of ServeContr. For example, when I set CONTROL_LOOP_PERIOD_S to 1, ServeContr only takes less than 10% of cpu when there’s 100 worker running. However, it doesn’t help much on gcs_server, in which case gcs_server’s cpu consumption is still around 60%-70%. Is it expected? In addition, it still takes similar amount of time (about 90 seconds) to distribute 100 tasks to 100 workers as before.
Regarding to cores, I mean ‘worker’ and in our cluster, each node has many cores.
I think there is a misconfiguration here. Each worker node in Ray should be configured with many cores. So ideally in your cluster status you see something like “4 workers”, and each worker has 16-32+ CPUs each. Assigning 1 core per worker node is an extremely inefficient / usual configuration for Ray since it basically flattens the two level scheduling into a single level.