Task failed with GrpcUnavailable

Hi guys,

We have a Ray Cluster running on K8s and recently we saw many tasks failed quickly with below errors when we launched 300 tasks and waited for their results:

[2023-12-10 00:14:44,985 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,117 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,257 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,389 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,550 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,651 I 16660 16671] task_manager.cc:891: Task failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: : Type=NORMAL_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=bct.distributed.ray.pricing.ray_pricing, class_name=, function_name=price_requests_remote, function_hash=814a2dc5d924496ab07a27940914fa0f}, task_id=a438df6b7d970c5effffffffffffffffffffffff01000000, task_name=price_requests_remote, job_id=01000000, num_args=6, num_returns=1, depth=228122, attempt_number=0, max_retries=3, runtime_env_hash=-17597244, eager_install=1, setup_timeout_seconds=600
[2023-12-10 00:14:45,651 I 16660 16671] raylet_client.cc:381: Error returning worker: Invalid: Returned worker does not exist any more
[2023-12-10 00:14:45,688 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,852 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,991 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:46,122 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000

I found a similar post but it didn’t seem to be the issue we encountered.

https://discuss.ray.io/t/system-will-be-halted-when-tasks-number-is-large/9754/19

The error is reproducible in our environment, so is there any recommendation how to investigate what’s happening here?

Thanks,
-BS

cc @jjyao for thoughts

When Ray schedules a task, this is what’s happening.

  1. Ray pings raylet to find a worker to schedule a task
  2. Once worker is found, ray sends a gRPC request to the worker to start the task.

Basically, in your case, the gRPC request to 2 fails due to “socket closed”. This error is from gRPC. I personally haven’t seen this error, but I think this can happen upon network failure or something like that.

Do you deploy Ray in a special environment that could trigger network failure more often? Also, is your worker node and head node able to communicate each other through all ports between --min-worker-port (default 10002) and --max-worker-port (default 19999)?

Thanks sangcho. It turns out to be oom of ray serve process. In my application, I have a serve api deployed and when it receive a request, it generates many ray remote tasks and distributes them to the grid. When above error happened, I also saw something like this from ray serve log:

I do see memory increase from dashboard of that particular process each time it’s called. In order to find the memory leak, I tried memray according to the doc, but dosen’t find any red-herring objects from the flamegraph … I’m thinking of whether it’s the native memory that leaks … any suggestion how I should investigate?

Thanks,
-BS

native memory should come up to memray I believe.

what’s the actual memory usage when you see this oom error? Also what’s the process that takes up most of memory?

I did more testing and digging and it looks like a similar issue to this one: Ray remote task + fastapi memory leak · Issue #41260 · ray-project/ray · GitHub and I see the PR is merged.

So the question is: is this fix included in the latest nightly build?

Thanks,
-BS

yes it is already in the nightly. I am not sure if it will be included in the next release (2.9) though. cc @jjyao for confirmation.

Thanks sangcho. I also verified the latest nightly build solved the OOM issue. It would be great if this fix can be included in a release asap cause it’s preventing us releasing some important feature in prod.

Thanks,
-BS