Gcs_server takes almost 100% cpu even though there's no running task

hi there, I recently run into a situation that the gcs_server process on my head node is taking up a core, but there’s no active task running at that moment.


And at this point, there’s nothing logged into gcs_server.out/err either.

My cluster is set up on k8s consisting of 100 cores (Ray version is 1.9). When the cluster is just started, the gcs_server seems working fine (and initially there’s only 10 cores). Then I launch my compute tasks, about 100 tasks, then with autoscaler, the cluster scales to 100 cores and finishes all tasks. Then I find out that gcs_server started being very busy with something.

I also notice that my tasks are scheduled to worker nodes very very slow, For example, I launch 100 tasks at the same time, and there are 100 cores. I would expect all cores are taken up pretty quickly, however, what I observed from dashboard is my 100 tasks get scheduled to worker in a speed of roughly 5 per second … which leads to a very low utilization of the grid. Not sure whether it’s because of the gcs_server …

Appreciate people could shed some lights and would love to provide more information.

Thanks,
-BS

Could you provide a reproduction script? That’s a very odd issue and potentially bug.

I also noticed you have a ServeController.listen_for_changes() actor running at a significant CPU utilization, maybe that could be related since Serve does monitor the cluster state through the GCS.

1 Like

Yes, this is really wired. Btw, how many nodes are there in your cluster? Could you also run ray status to show some cluster information?

Thanks @yic and @ericl

Unfortunately, I can’t reproduce it in a reasonable small example. And in fact, I only saw this from our test environments. I tried setting up a k8s cluster by rancher and provided similar workload, but can’t reproduce the issue.

Here’s the result of ray status I just run from my head node. At this point, there’s no active work going on, and there’s 100 workers being brought up 20 min earlier (still standing by).

It seems that gcs_server high cpu utilization is proportional to the number of workers.

One more thing to mention is that our head node resource configuration (cpu & memory) is just like the worker, which might not be sufficient? What’s the recommended config for head?

You might want to give the head some more resources, but GCS shouldn’t use 100% CPU either way. I think to diagnose we need one of two things then:

  1. The stack trace of the ServerContr actor, you can get this by running the ray stack command on the head node, or
  2. Some sampled stack traces from the gcs_server binary: Profiling for Ray Developers — Ray 3.0.0.dev0

100 workers

Btw, when you say 100 cores, do you mean each node (“worker” here) has 1 core? Generally we recommend allocating as large a node as possible (e.g., 32 cores per node). This is since each node adds scheduilng overhead.

Here’s the ray stack result after 100 tasks finish:


Regarding to cores, I mean ‘worker’ and in our cluster, each node has many cores.

At the same time, I’m trying profiling the gcs_server and will report back later.

Thanks Eric

Eric, here are logs from gcs_server, serve, and dashboard when their cpu load is low and high. I upload those files into a github repo:

Hmm, the main suspicious stack I see from high_cpu_gcs_server.log is

#12 0x0000563d386d2f0f in ray::gcs::GcsResourceManager::HandleGetResources(ray::rpc::GetResourcesRequest const&, ray::rpc::GetResourcesReply*, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>) ()
#13 0x0000563d3861647c in std::_Function_handler<void (), ray::rpc::ServerCallImpl<ray::rpc::NodeResourceInfoGcsServiceHandler, ray::rpc::GetResourcesRequest, ray::rpc::GetResourcesReply>::HandleRequest()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()

Is this the only one that shows up when you take a stack sample? If so, then maybe it could be the ServeController polling resources, though I couldn’t confirm this based on the stack trace.

@simon-mo @eoakes on a 100-node cluster, would it be expected that ServeController could cause a lot of GCS load polling cluster status or something?

Ah yes! There’s a call to ray.state.node_ids() every 0.1s right now. Looks very likely to be the root cause.

I would recommend sending a PR to make the constant an env var to start with so they can be configured.

It is indeed the only that shows up.

Great, I will send a PR later.

One more question: is there any side-effect if I increase this number, say to 1 sec?

Thanks,
-BS

hi @simon-mo @ericl , I give it a try by increasing CONTROL_LOOP_PERIOD_S. It indeed helps reduce the cpu utilization of ServeContr. For example, when I set CONTROL_LOOP_PERIOD_S to 1, ServeContr only takes less than 10% of cpu when there’s 100 worker running. However, it doesn’t help much on gcs_server, in which case gcs_server’s cpu consumption is still around 60%-70%. Is it expected? In addition, it still takes similar amount of time (about 90 seconds) to distribute 100 tasks to 100 workers as before.

Regarding to cores, I mean ‘worker’ and in our cluster, each node has many cores.

I think there is a misconfiguration here. Each worker node in Ray should be configured with many cores. So ideally in your cluster status you see something like “4 workers”, and each worker has 16-32+ CPUs each. Assigning 1 core per worker node is an extremely inefficient / usual configuration for Ray since it basically flattens the two level scheduling into a single level.

1 Like

Aha, that makes sense, thanks Eric. And confirmed it’s solved both gcs_server and task scheduling issue.