Local disk is full

hello i am using ray to get stream and share frame and postprocess data between processes with ray, a few days after run code i take below error :



Exception in thread Thread-29 (_process_response):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tritonclient/grpc/__init__.py", line 2056, in _process_response

(raylet) [2023-08-15 16:42:04,877 E 773 856] (raylet) dlmalloc.cc:213: mmap failed with error: Cannot allocate memory

(raylet) [2023-08-15 16:42:04,877 E 773 856] (raylet) object_lifecycle_manager.cc:214: Plasma fallback allocator failed, likely out of disk space.

(raylet) [2023-08-15 16:42:04,930 E 773 856] (raylet) dlmalloc.cc:213: mmap failed with error: Cannot allocate memory

(raylet) [2023-08-15 16:42:04,930 E 773 856] (raylet) object_lifecycle_manager.cc:214: Plasma fallback allocator failed, likely out of disk space.

(raylet) [2023-08-15 16:42:04,877 E 773 856] (raylet) dlmalloc.cc:213: mmap failed with error: Cannot allocate memory

(raylet) [2023-08-15 16:42:04,877 E 773 856] (raylet) object_lifecycle_manager.cc:214: Plasma fallback allocator failed, likely out of disk space.

(raylet) [2023-08-15 16:42:04,930 E 773 856] (raylet) dlmalloc.cc:213: mmap failed with error: Cannot allocate memory

(raylet) [2023-08-15 16:42:04,930 E 773 856] (raylet) object_lifecycle_manager.cc:214: Plasma fallback allocator failed, likely out of disk space.
    self._callback(result=result, error=error)
  File "/app/InsightFace_Pytorch/GPUVideoReader.py", line 226, in completion_callback_yolo_pose
    self.detections_queue.put((frame_number, frame, True, human_detections))
  File "/usr/local/lib/python3.10/dist-packages/ray/util/queue.py", line 105, in put
    ray.get(self.actor.put.remote(item, timeout))
  File "/usr/local/lib/python3.10/dist-packages/ray/actor.py", line 138, in remote
    return self._remote(args, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 425, in _start_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/actor.py", line 184, in _remote
    return invocation(args, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/actor.py", line 171, in invocation
    return actor._actor_method_call(
  File "/usr/local/lib/python3.10/dist-packages/ray/actor.py", line 1169, in _actor_method_call
    object_refs = worker.core_worker.submit_actor_task(
  File "python/ray/_raylet.pyx", line 2164, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 2169, in ray._raylet.CoreWorker.submit_actor_task
  File "python/ray/_raylet.pyx", line 425, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 416, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 509, in ray._raylet.prepare_args_internal
  File "python/ray/_raylet.pyx", line 1780, in ray._raylet.CoreWorker.put_serialized_object_and_increment_local_ref
  File "python/ray/_raylet.pyx", line 1669, in ray._raylet.CoreWorker._create_put_buffer
  File "python/ray/_raylet.pyx", line 197, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full

The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

(raylet) [2023-08-15 16:43:35,751 E 773 856] (raylet) dlmalloc.cc:213: mmap failed with error: Cannot allocate memory

(raylet) [2023-08-15 16:43:35,751 E 773 856] (raylet) object_lifecycle_manager.cc:214: Plasma fallback allocator failed, likely out of disk space.

(raylet) [2023-08-15 16:43:35,751 E 773 856] (raylet) dlmalloc.cc:213: mmap failed with error: Cannot allocate memory

(raylet) [2023-08-15 16:43:35,751 E 773 856] (raylet) object_lifecycle_manager.cc:214: Plasma fallback allocator failed, likely out of disk space.


Local disk is full

The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

Traceback (most recent call last):
 FINALIZE START: 2023-08-15 16:51:01.738
  File "/app/InsightFace_Pytorch/inference.py", line 318, in <module>
 Traceback (most recent call last):
   File "/app/InsightFace_Pytorch/detect.py", line 89, in __call__
     raise Exception
 Exception
    inference()
  File "/app/InsightFace_Pytorch/inference.py", line 271, in __call__
    self.ending()
  File "/app/InsightFace_Pytorch/inference.py", line 260, in ending
    ray.get([self.dp])
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2380, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::detector_process() (pid=1939, ip=10.10.59.36)
Exception


During handling of the above exception, another exception occurred:

ray::detector_process() (pid=1939, ip=10.10.59.36)
  File "/app/InsightFace_Pytorch/inference.py", line 285, in detector_process
    detector()
  File "/app/InsightFace_Pytorch/detect.py", line 92, in __call__
    raise Exception
Exception

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

(raylet) [low_level_alloc.cc : 570] RAW: mmap error: 12

(raylet) [failure_signal_handler.cc : 329] RAW: Signal 6 raised at PC=0x7f80d319da7c while already in AbslFailureSignalHandler()

(raylet) *** SIGABRT received at time=1692119156 on cpu 2 ***

2023-08-15 17:06:00,232	WARNING worker.py:1866 -- Raylet is terminated: ip=10.10.59.36, id=c3cd030bb6e5f49cf7d5053ea427ef7ad8a84916a047a93de8117091. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:

    - num bytes created total: 79050230353733
    0 pending objects of total size 0MB
    - objects spillable: 50019
    - bytes spillable: 313830615300
    - objects unsealed: 0
    - bytes unsealed: 0
    - objects in use: 50491
    - bytes in use: 316884087695
    - objects evictable: 0
    - bytes evictable: 0
    - objects created by worker: 50019
    - bytes created by worker: 313830615300
    - objects restored: 472
    - bytes restored: 3053472395
    - objects received: 0
    - bytes received: 0
    - objects errored: 0
    - bytes errored: 0

2023-08-15 17:06:00,247	WARNING worker.py:1866 -- Raylet is terminated: ip=10.10.59.36, id=c3cd030bb6e5f49cf7d5053ea427ef7ad8a84916a047a93de8117091. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:

    - num bytes created total: 79050230353733
    0 pending objects of total size 0MB
    - objects spillable: 50019
    - bytes spillable: 313830615300
    - objects unsealed: 0
    - bytes unsealed: 0
    - objects in use: 50491
    - bytes in use: 316884087695
    - objects evictable: 0
    - bytes evictable: 0
    - objects created by worker: 50019
    - bytes created by worker: 313830615300
    - objects restored: 472
    - bytes restored: 3053472395
    - objects received: 0
    - bytes received: 0
    - objects errored: 0
    - bytes errored: 0

2023-08-15 17:06:11,090	WARNING worker.py:1866 -- The node with node id: c3cd030bb6e5f49cf7d5053ea427ef7ad8a84916a047a93de8117091 and address: 10.10.59.36 and node name: 10.10.59.36 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, preempted node, etc.) 

(2) raylet has lagging heartbeats due to slow network or busy workload.

2023-08-15 17:06:11,104	WARNING worker.py:1866 -- The node with node id: c3cd030bb6e5f49cf7d5053ea427ef7ad8a84916a047a93de8117091 and address: 10.10.59.36 and node name: 10.10.59.36 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, preempted node, etc.) 

(2) raylet has lagging heartbeats due to slow network or busy workload.

(TrackerActivator pid=1940) Traceback (most recent call last):

(TrackerActivator pid=1940)   File "/app/InsightFace_Pytorch/tr.py", line 209, in main_func

(TrackerActivator pid=1940)     frame_number,frame,is_detection_frame,human_detections = self.detections_queue.get()

(TrackerActivator pid=1940)   File "/usr/local/lib/python3.10/dist-packages/ray/util/queue.py", line 160, in get

(TrackerActivator pid=1940)     return ray.get(self.actor.get.remote(timeout))

(TrackerActivator pid=1940)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper

(TrackerActivator pid=1940)     return func(*args, **kwargs)

(TrackerActivator pid=1940)   File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2380, in get

(TrackerActivator pid=1940)     raise value.as_instanceof_cause()

(TrackerActivator pid=1940) ray.exceptions.RayTaskError: ray::_QueueActor.get() (pid=1937, ip=10.10.59.36, repr=<ray.util.queue._QueueActor object at 0x7facdc8f1e10>)

(TrackerActivator pid=1940) ray.exceptions.OutOfDiskError: Local disk is full

(TrackerActivator pid=1940) The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

(TrackerActivator pid=1940) /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown

(TrackerActivator pid=1940)   warnings.warn('resource_tracker: There appear to be %d '

Heartbeat session expired, marking coordinator dead

Marking the coordinator dead (node coordinator-1001) for group consumer-group-a: Heartbeat session expired.

i try to fix this with this suggestion " python - Ray object store running out of memory using out of core. How can I configure an external object store like s3 bucket? - Stack Overflow "

but this not work for me, now i am trying latest version ray, but i am waiting the same response, what can i do for this, do you have any suggestion?

extra: i am working on docker but docker on host and it have all permission

thank you for your interest.