Hi there,
I recently started using Ray at work to help improve my ability to do large parallel computations in Python. I have a Kubernetes cluster (running on EKS) and I installed the Ray helm chart with the Ray docker image for Python version 3.7 and Ray version 1.11.0.
I’ve been intermittently running into the following error when trying to call ray.get
on a generic task that reads a set of e.g. 100 json files from S3 and combines them into a single json file:
Traceback (most recent call last):
....
File "/home/airflow/.local/lib/python3.7/site-packages/airflow_on_k8s/tasks/lib/data_transformers.py", line 569, in ray_combine_files
small_file_mappings = ray.get(small_file_futures)
File "/home/airflow/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File "/home/airflow/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 359, in get
res = self._get(to_get, op_timeout)
File "/home/airflow/.local/lib/python3.7/site-packages/ray/util/client/worker.py", line 379, in _get
raise decode_exception(e)
ConnectionError: GRPC connection failed: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.NOT_FOUND
details = "Failed to serialize response!"
debug_error_string = "{"created":"@1650490482.895831759","description":"Error received from peer ipv4:10.100.243.149:10001","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Failed to serialize response!","grpc_status":5}"
What’s really weird / bothering me is that this error is non-deterministic. I run my data pipelines in Airflow, which will automatically retry a job on failure, and I’ve had the job fail initially and then succeed on the retry. There’s nothing randomized about my task / code, so this leaves me to believe that this error has something to do with the Ray head’s communication with the workers.
Has anyone ever run into this issue before or have insight on what might be the cause? This post seemed to run into the same issue but it was never resolved: Failure to serialize response.
I would be more than happy to provide more details and answer any questions about my setup that might get to the root cause, so please don’t hesitate to ask me for more details.
(FWIW I’m guessing this is some sort of problem like a timeout error, node or pod unexpectedly dying, or the head’s connection to the service being dropped somehow, but I’m not even sure how to begin to debug this issue)