An unexpected internal error occurred while the IO worker was spilling objects

dirtyValera · June 6, 2023, 7:10am

I’'m running a cluster on my laptop with ray start --head and launching my program with

with ray.init(address='auto', ignore_reinit_error=True):
    ...

When object store is full, object spilling fails with

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2023-05-07_09-24-33_669967_36323/ray_spilled_objects/37539b8a4799451eb12c9c4e96bdcc16-multi-27'
An unexpected internal error occurred while the IO worker was spilling objects: [Errno 2] No such file or directory: '/tmp/ray/session_2023-05-07_09-24-33_669967_36323/ray_spilled_objects/37539b8a4799451eb12c9c4e96bdcc16-multi-27'

Is there any additional configuration I need to set? How do I make spilling work properly?

vitsai · June 6, 2023, 1:47pm

Hi, can you give us more details about the platform you are running on? Which operating system and version of Ray are you using?

dirtyValera · June 6, 2023, 2:03pm

Mac M2, OS: Ventura 13.3.1, Ray 2.4.0

vitsai · June 9, 2023, 11:27pm

By default there shouldn’t be any configuration you need to set, although if you want, you can set the object_spilling_config during init [Object Spilling — Ray 2.5.0].

How long were you running this cluster for? Can you check if perhaps /tmp/ray/session_2023-05-07_09-24-33_669967_36323/ was deleted after creation?

Agon · December 29, 2023, 12:04pm

@vitsai - am getting a similar, possibly-related, issue, when running a ray cluster on a kubernetes cluster managed by kube-ray, e.g.

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 2116, in ray._raylet.spill_objects_handler
  File "python/ray/_raylet.pyx", line 2119, in ray._raylet.spill_objects_handler
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/external_storage.py", line 668, in spill_objects
    return _external_storage.spill_objects(object_refs, owner_addresses)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/external_storage.py", line 304, in spill_objects
    with open(url, "wb", buffering=self._buffer_size) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/shared-drive/ray/ray_spilled_objects/e4eb5f8f53e3402f9d3e77543074f3af-multi-45'
An unexpected internal error occurred while the IO worker was spilling objects: [Errno 2] No such file or directory: '/mnt/shared-drive/ray/ray_spilled_objects/e4eb5f8f53e3402f9d3e77543074f3af-multi-45'

My RayCluster CR has this in the config:

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: ...
spec:
  rayVersion: "2.7.1"
  enableInTreeAutoscaling: true
  autoscalerOptions: {}

  headGroupSpec:
    rayStartParams:
      system-config: >-
        '{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/mnt/shared-drive/ray\",\"buffer_size\":67108864}}"}'

Which I put together using help from the docs.

Every pod (the headGroupSpec, and every workerGroupSpec) has got this shared-drive mounted as a volume, and all have read-write access.

The first time I ran a job on this cluster, which produced objects too big for the ObjectStore and required spilling, all I got were errors like the one I pasted above.

I thought I’d be helpful and manually create this ray_spilled_objects subdirectory (full path: /mnt/shared-drive/ray/ray_spilled_objects), and things started working - I was watching the directory and could see that ray was indeed putting files in here!

Later on during the same job, whilst I was watching the contents of this shared drive, I could see that the directory I’d created got deleted

After it was deleted, the same errors started popping up again (i.e. [Errno 2] No such file or directory).

I don’t know where to look in the ray source code to figure out why this subdirectory is getting deleted (or why it’s not automatically created whenever ray tries to spill an object there), but hopefully someone could take a look, and confirm if my diagnosis is actually correct?

In the meantime I might just run a script that periodically ensures that this subdirectory exists - but it’s definitely not a long-term solution

I’m running ray version 2.7.1 on Ubuntu linux: both on the driver (my laptop), and the cluster (on kubernetes)

Agon · December 29, 2023, 12:39pm

Welp, tried looking at the source code for some clues and figured out I could avoid the issue by using the smart_open method for spilling.

So changed from:

{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/mnt/shared-drive/ray\",\"buffer_size\":67108864}}"}

To:

'{"object_spilling_config":"{\"type\":\"smart_open\",\"params\":{\"uri\":\"/mnt/shared-drive/ray\"},\"buffer_size\":67108864}"}'

And now the spilled objects take the form: /mnt/shared-drive/ray/ray_spilled_objects-{hex}-multi-{some-int} (so no needing an extra subdirectory).

Thought I’d report back the workaround in case future users have the same issue, but would be good if this was fixed for the original filesystem method.

Topic		Replies	Views
Object spilling in cluster mode on NFS errors Ray Core	3	442	November 29, 2022
Configuring object spilling to another folder: still full!	2	173	April 5, 2024
How to disable spill object on disk and just let app failed? Ray Core	3	454	April 15, 2025
How to change the raylet spill directory? Ray Core	4	1027	December 2, 2022
Error when spilling objects: buf_len = len(buf) - 'NoneType' has no len() Ray Core	5	447	February 12, 2021

An unexpected internal error occurred while the IO worker was spilling objects

Related topics