Task failed with GrpcUnavailable

Hi guys,

We have a Ray Cluster running on K8s and recently we saw many tasks failed quickly with below errors when we launched 300 tasks and waited for their results:

[2023-12-10 00:14:44,985 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,117 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,257 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,389 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,550 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,651 I 16660 16671] task_manager.cc:891: Task failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: : Type=NORMAL_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=bct.distributed.ray.pricing.ray_pricing, class_name=, function_name=price_requests_remote, function_hash=814a2dc5d924496ab07a27940914fa0f}, task_id=a438df6b7d970c5effffffffffffffffffffffff01000000, task_name=price_requests_remote, job_id=01000000, num_args=6, num_returns=1, depth=228122, attempt_number=0, max_retries=3, runtime_env_hash=-17597244, eager_install=1, setup_timeout_seconds=600
[2023-12-10 00:14:45,651 I 16660 16671] raylet_client.cc:381: Error returning worker: Invalid: Returned worker does not exist any more
[2023-12-10 00:14:45,688 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,852 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,991 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:46,122 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000

I found a similar post but it didn’t seem to be the issue we encountered.


The error is reproducible in our environment, so is there any recommendation how to investigate what’s happening here?


cc @jjyao for thoughts

When Ray schedules a task, this is what’s happening.

  1. Ray pings raylet to find a worker to schedule a task
  2. Once worker is found, ray sends a gRPC request to the worker to start the task.

Basically, in your case, the gRPC request to 2 fails due to “socket closed”. This error is from gRPC. I personally haven’t seen this error, but I think this can happen upon network failure or something like that.

Do you deploy Ray in a special environment that could trigger network failure more often? Also, is your worker node and head node able to communicate each other through all ports between --min-worker-port (default 10002) and --max-worker-port (default 19999)?

Thanks sangcho. It turns out to be oom of ray serve process. In my application, I have a serve api deployed and when it receive a request, it generates many ray remote tasks and distributes them to the grid. When above error happened, I also saw something like this from ray serve log:

I do see memory increase from dashboard of that particular process each time it’s called. In order to find the memory leak, I tried memray according to the doc, but dosen’t find any red-herring objects from the flamegraph … I’m thinking of whether it’s the native memory that leaks … any suggestion how I should investigate?


native memory should come up to memray I believe.

what’s the actual memory usage when you see this oom error? Also what’s the process that takes up most of memory?

I did more testing and digging and it looks like a similar issue to this one: Ray remote task + fastapi memory leak · Issue #41260 · ray-project/ray · GitHub and I see the PR is merged.

So the question is: is this fix included in the latest nightly build?


yes it is already in the nightly. I am not sure if it will be included in the next release (2.9) though. cc @jjyao for confirmation.

Thanks sangcho. I also verified the latest nightly build solved the OOM issue. It would be great if this fix can be included in a release asap cause it’s preventing us releasing some important feature in prod.
