[Ray K8s cluster] - Script exit

Hello Everyone,

I’ve been dealing with a challenging issue while processing large volumes of satellite data. I’m utilizing the Ray framework to launch a pool of processes in a Kubernetes environment. However, due to the varying memory requirements of each job, I occasionally run into situations where a job exceeds its memory allocation, leading to a MemoryError.

This issue doesn’t merely terminate the task of the specific worker that caused the error, but it also halts the entire Python run, disrupting all jobs. This makes tracking and resolving the issue significantly difficult.

This is the only clue I have:

Traceback (most recent call last):
  File "<path-to-the-file>.py", line 2130, in <module>
    _ = ray.get(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2542, in get
    raise value
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.

[1]+  Exit 1                  python ./data_ingestion_v0-1.py

I’m in search of a solution that would handle these MemoryErrors more effectively to prevent the entire job run from being lost. Ideally, there might be an environment variable in the Kubernetes setup or a Python parameter within Ray that could be adjusted to manage these occurrences.

Has anyone experienced a similar issue or can provide some insights on how to deal with this? Any help or advice would be extremely appreciated, as this problem has been a persistent hurdle for quite some time now.

Thank you in advance for your assistance!