Gcs_server takes almost 100% cpu even though there's no running task

blshao84 · June 11, 2022, 12:04pm

hi there, I recently run into a situation that the gcs_server process on my head node is taking up a core, but there’s no active task running at that moment.

And at this point, there’s nothing logged into gcs_server.out/err either.

My cluster is set up on k8s consisting of 100 cores (Ray version is 1.9). When the cluster is just started, the gcs_server seems working fine (and initially there’s only 10 cores). Then I launch my compute tasks, about 100 tasks, then with autoscaler, the cluster scales to 100 cores and finishes all tasks. Then I find out that gcs_server started being very busy with something.

I also notice that my tasks are scheduled to worker nodes very very slow, For example, I launch 100 tasks at the same time, and there are 100 cores. I would expect all cores are taken up pretty quickly, however, what I observed from dashboard is my 100 tasks get scheduled to worker in a speed of roughly 5 per second … which leads to a very low utilization of the grid. Not sure whether it’s because of the gcs_server …

Appreciate people could shed some lights and would love to provide more information.

Thanks,
-BS

ericl · June 11, 2022, 5:19pm

Could you provide a reproduction script? That’s a very odd issue and potentially bug.

I also noticed you have a ServeController.listen_for_changes() actor running at a significant CPU utilization, maybe that could be related since Serve does monitor the cluster state through the GCS.

yic · June 13, 2022, 5:50pm

Yes, this is really wired. Btw, how many nodes are there in your cluster? Could you also run ray status to show some cluster information?

blshao84 · June 14, 2022, 1:33am

Thanks @yic and @ericl

Unfortunately, I can’t reproduce it in a reasonable small example. And in fact, I only saw this from our test environments. I tried setting up a k8s cluster by rancher and provided similar workload, but can’t reproduce the issue.

Here’s the result of ray status I just run from my head node. At this point, there’s no active work going on, and there’s 100 workers being brought up 20 min earlier (still standing by).

It seems that gcs_server high cpu utilization is proportional to the number of workers.

blshao84 · June 14, 2022, 1:36am

One more thing to mention is that our head node resource configuration (cpu & memory) is just like the worker, which might not be sufficient? What’s the recommended config for head?

ericl · June 14, 2022, 1:47am

You might want to give the head some more resources, but GCS shouldn’t use 100% CPU either way. I think to diagnose we need one of two things then:

The stack trace of the ServerContr actor, you can get this by running the ray stack command on the head node, or
Some sampled stack traces from the gcs_server binary: Profiling for Ray Developers — Ray 3.0.0.dev0

100 workers

Btw, when you say 100 cores, do you mean each node (“worker” here) has 1 core? Generally we recommend allocating as large a node as possible (e.g., 32 cores per node). This is since each node adds scheduilng overhead.

blshao84 · June 14, 2022, 2:24am

Here’s the ray stack result after 100 tasks finish:

Regarding to cores, I mean ‘worker’ and in our cluster, each node has many cores.

At the same time, I’m trying profiling the gcs_server and will report back later.

Thanks Eric

blshao84 · June 14, 2022, 11:35am

Eric, here are logs from gcs_server, serve, and dashboard when their cpu load is low and high. I upload those files into a github repo:

github.com

blshao84/ray-deploy-repro/blob/main/high_cpu_dashboard.log

[New LWP 216]
[New LWP 217]
[New LWP 218]
[New LWP 219]
[New LWP 220]
[New LWP 221]
[New LWP 222]
[New LWP 223]
[New LWP 224]
[New LWP 225]
[New LWP 226]
[New LWP 227]
[New LWP 228]
[New LWP 235]
[New LWP 276]
[New LWP 321]
[New LWP 342]
[New LWP 344]
[New LWP 346]
[New LWP 385]

This file has been truncated. show original

github.com

blshao84/ray-deploy-repro/blob/main/high_cpu_gcs_server.log

[New LWP 178]
[New LWP 179]
[New LWP 180]
[New LWP 181]
[New LWP 182]
[New LWP 183]
[New LWP 184]
[New LWP 185]
[New LWP 186]
[New LWP 187]
[New LWP 188]
[New LWP 189]
[New LWP 211]
[New LWP 934]
[New LWP 943]
[New LWP 1174]
[New LWP 1175]
[New LWP 1176]
[New LWP 1177]
[New LWP 1178]

This file has been truncated. show original

github.com

blshao84/ray-deploy-repro/blob/main/high_cpu_serve.log

[New LWP 386]
[New LWP 387]
[New LWP 388]
[New LWP 389]
[New LWP 395]
[New LWP 396]
[New LWP 397]
[New LWP 398]
[New LWP 399]
[New LWP 400]
[New LWP 401]
[New LWP 402]
[New LWP 403]
[New LWP 404]
[New LWP 405]
[New LWP 406]
[New LWP 407]
[New LWP 408]
[New LWP 409]
[New LWP 410]

This file has been truncated. show original

github.com

blshao84/ray-deploy-repro/blob/main/init_gcs_server.log

[New LWP 178]
[New LWP 179]
[New LWP 180]
[New LWP 181]
[New LWP 182]
[New LWP 183]
[New LWP 184]
[New LWP 185]
[New LWP 186]
[New LWP 187]
[New LWP 188]
[New LWP 189]
[New LWP 211]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f05ba97846e in epoll_wait (epfd=8, events=0x7fff5da16a20, maxevents=128, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.

Thread 14 (Thread 0x7f05977fe700 (LWP 211)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

This file has been truncated. show original

ericl · June 14, 2022, 5:54pm

Hmm, the main suspicious stack I see from high_cpu_gcs_server.log is

#12 0x0000563d386d2f0f in ray::gcs::GcsResourceManager::HandleGetResources(ray::rpc::GetResourcesRequest const&, ray::rpc::GetResourcesReply*, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>) ()
#13 0x0000563d3861647c in std::_Function_handler<void (), ray::rpc::ServerCallImpl<ray::rpc::NodeResourceInfoGcsServiceHandler, ray::rpc::GetResourcesRequest, ray::rpc::GetResourcesReply>::HandleRequest()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()

Is this the only one that shows up when you take a stack sample? If so, then maybe it could be the ServeController polling resources, though I couldn’t confirm this based on the stack trace.

@simon-mo @eoakes on a 100-node cluster, would it be expected that ServeController could cause a lot of GCS load polling cluster status or something?

simon-mo · June 14, 2022, 6:19pm

Ah yes! There’s a call to ray.state.node_ids() every 0.1s right now. Looks very likely to be the root cause.

github.com

ray-project/ray/blob/master/python/ray/serve/utils.py#L163

      
        
                Handles multiple nodes on the same IP by appending an index to the
                node_id, e.g., 'node_id-index'.
            
            
    Returns a list of ('node_id-index', 'node_id') tuples (the latter can be
                used as a resource requirement for actor placements).
                """
                node_ids = []
                # We need to use the node_id and index here because we could
                # have multiple virtual nodes on the same host. In that case
                # they will have the same IP and therefore node_id.
                for _, node_id_group in groupby(sorted(ray.state.node_ids())):
                    for index, node_id in enumerate(node_id_group):
                        node_ids.append(("{}-{}".format(node_id, index), node_id))
            
            
    return node_ids
            
            

            
def node_id_to_ip_addr(node_id: str):
                """Recovers the IP address for an entry from get_all_node_ids."""
                if ":" in node_id:
                    node_id = node_id.split(":")[1]

github.com

ray-project/ray/blob/master/python/ray/serve/constants.py#L36

      
        
            #: HTTP Port
            DEFAULT_HTTP_PORT = 8000
            
            
#: Controller checkpoint path
            DEFAULT_CHECKPOINT_PATH = "ray://"
            
            
#: Max concurrency
            ASYNC_CONCURRENCY = int(1e6)
            
            
# How often to call the control loop on the controller.
            CONTROL_LOOP_PERIOD_S = 0.1
            
            
#: Max time to wait for HTTP proxy in `serve.start()`.
            HTTP_PROXY_TIMEOUT = 60
            
            
#: Max retry count for allowing failures in replica constructor.
            #: If no replicas at target version is running by the time we're at
            #: max construtor retry count, deploy() is considered failed.
            #: By default we set threshold as min(num_replicas * 3, this value)
            MAX_DEPLOYMENT_CONSTRUCTOR_RETRY_COUNT = 100

I would recommend sending a PR to make the constant an env var to start with so they can be configured.

blshao84 · June 15, 2022, 12:41am

It is indeed the only that shows up.

blshao84 · June 15, 2022, 12:46am

Great, I will send a PR later.

One more question: is there any side-effect if I increase this number, say to 1 sec?

Thanks,
-BS

blshao84 · June 15, 2022, 2:32am

hi @simon-mo @ericl , I give it a try by increasing CONTROL_LOOP_PERIOD_S. It indeed helps reduce the cpu utilization of ServeContr. For example, when I set CONTROL_LOOP_PERIOD_S to 1, ServeContr only takes less than 10% of cpu when there’s 100 worker running. However, it doesn’t help much on gcs_server, in which case gcs_server’s cpu consumption is still around 60%-70%. Is it expected? In addition, it still takes similar amount of time (about 90 seconds) to distribute 100 tasks to 100 workers as before.

ericl · June 15, 2022, 5:37am

Regarding to cores, I mean ‘worker’ and in our cluster, each node has many cores.

I think there is a misconfiguration here. Each worker node in Ray should be configured with many cores. So ideally in your cluster status you see something like “4 workers”, and each worker has 16-32+ CPUs each. Assigning 1 core per worker node is an extremely inefficient / usual configuration for Ray since it basically flattens the two level scheduling into a single level.

blshao84 · June 15, 2022, 7:18am

Aha, that makes sense, thanks Eric. And confirmed it’s solved both gcs_server and task scheduling issue.

Topic		Replies	Views
Autoscaling is very slow and not working correctly Ray Clusters	6	634	April 30, 2021
Processing performance of tasks Ray Core	14	822	March 8, 2021
Cluster usage is not 100% rather 57% Ray Clusters	0	419	October 21, 2021
ERROR gcs_utils.py:137 -- Failed to send request to gcs Ray Clusters	20	2690	February 11, 2022
Ray / gRPC Ambiguous Error Message Kubernetes	12	2274	May 13, 2022

Gcs_server takes almost 100% cpu even though there's no running task

Related topics