[Ray K8s cluster] - Script exit

jhowpd · July 8, 2023, 12:36am

Hello Everyone,

I’ve been dealing with a challenging issue while processing large volumes of satellite data. I’m utilizing the Ray framework to launch a pool of processes in a Kubernetes environment. However, due to the varying memory requirements of each job, I occasionally run into situations where a job exceeds its memory allocation, leading to a MemoryError.

This issue doesn’t merely terminate the task of the specific worker that caused the error, but it also halts the entire Python run, disrupting all jobs. This makes tracking and resolving the issue significantly difficult.

This is the only clue I have:

Traceback (most recent call last):
  File "<path-to-the-file>.py", line 2130, in <module>
    _ = ray.get(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2542, in get
    raise value
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.

[1]+  Exit 1                  python ./data_ingestion_v0-1.py

I’m in search of a solution that would handle these MemoryErrors more effectively to prevent the entire job run from being lost. Ideally, there might be an environment variable in the Kubernetes setup or a Python parameter within Ray that could be adjusted to manage these occurrences.

Has anyone experienced a similar issue or can provide some insights on how to deal with this? Any help or advice would be extremely appreciated, as this problem has been a persistent hurdle for quite some time now.

Thank you in advance for your assistance!

Topic		Replies	Views
ray.exceptions.WorkerCrashedError Kubernetes	9	1909	August 22, 2022
Using Ray in AWS Batch in docker containers	0	986	June 23, 2022
Raylet worker processes are failing Ray Core	3	51	March 5, 2025
Parallel processing-OOM killer due to high memory	5	562	November 4, 2022
Ray on AKS using Kubernetes Job with runtime_env working_dir throws error Kubernetes	6	1062	January 21, 2022

[Ray K8s cluster] - Script exit

Related topics