Hi guys,
We have a Ray Cluster running on K8s and recently we saw many tasks failed quickly with below errors when we launched 300 tasks and waited for their results:
[2023-12-10 00:14:44,985 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,117 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,257 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,389 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,550 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,651 I 16660 16671] task_manager.cc:891: Task failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: : Type=NORMAL_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=bct.distributed.ray.pricing.ray_pricing, class_name=, function_name=price_requests_remote, function_hash=814a2dc5d924496ab07a27940914fa0f}, task_id=a438df6b7d970c5effffffffffffffffffffffff01000000, task_name=price_requests_remote, job_id=01000000, num_args=6, num_returns=1, depth=228122, attempt_number=0, max_retries=3, runtime_env_hash=-17597244, eager_install=1, setup_timeout_seconds=600
[2023-12-10 00:14:45,651 I 16660 16671] raylet_client.cc:381: Error returning worker: Invalid: Returned worker does not exist any more
[2023-12-10 00:14:45,688 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,852 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:45,991 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
[2023-12-10 00:14:46,122 I 16660 16680] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor 7b77b5182fcb9b9c0b96d3c001000000
I found a similar post but it didn’t seem to be the issue we encountered.
https://discuss.ray.io/t/system-will-be-halted-when-tasks-number-is-large/9754/19
The error is reproducible in our environment, so is there any recommendation how to investigate what’s happening here?
Thanks,
-BS