An unexpected internal error occurred while the IO worker was spilling objects

I’'m running a cluster on my laptop with ray start --head and launching my program with

with ray.init(address='auto', ignore_reinit_error=True):
    ...

When object store is full, object spilling fails with

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2023-05-07_09-24-33_669967_36323/ray_spilled_objects/37539b8a4799451eb12c9c4e96bdcc16-multi-27'
An unexpected internal error occurred while the IO worker was spilling objects: [Errno 2] No such file or directory: '/tmp/ray/session_2023-05-07_09-24-33_669967_36323/ray_spilled_objects/37539b8a4799451eb12c9c4e96bdcc16-multi-27'

Is there any additional configuration I need to set? How do I make spilling work properly?

Hi, can you give us more details about the platform you are running on? Which operating system and version of Ray are you using?

Mac M2, OS: Ventura 13.3.1, Ray 2.4.0

By default there shouldn’t be any configuration you need to set, although if you want, you can set the object_spilling_config during init [Object Spilling — Ray 2.5.0].

How long were you running this cluster for? Can you check if perhaps /tmp/ray/session_2023-05-07_09-24-33_669967_36323/ was deleted after creation?

@vitsai - am getting a similar, possibly-related, issue, when running a ray cluster on a kubernetes cluster managed by kube-ray, e.g.

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 2116, in ray._raylet.spill_objects_handler
  File "python/ray/_raylet.pyx", line 2119, in ray._raylet.spill_objects_handler
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/external_storage.py", line 668, in spill_objects
    return _external_storage.spill_objects(object_refs, owner_addresses)
  File "/opt/conda/lib/python3.10/site-packages/ray/_private/external_storage.py", line 304, in spill_objects
    with open(url, "wb", buffering=self._buffer_size) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/shared-drive/ray/ray_spilled_objects/e4eb5f8f53e3402f9d3e77543074f3af-multi-45'
An unexpected internal error occurred while the IO worker was spilling objects: [Errno 2] No such file or directory: '/mnt/shared-drive/ray/ray_spilled_objects/e4eb5f8f53e3402f9d3e77543074f3af-multi-45'

My RayCluster CR has this in the config:

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: ...
spec:
  rayVersion: "2.7.1"
  enableInTreeAutoscaling: true
  autoscalerOptions: {}

  headGroupSpec:
    rayStartParams:
      system-config: >-
        '{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/mnt/shared-drive/ray\",\"buffer_size\":67108864}}"}'

Which I put together using help from the docs.

Every pod (the headGroupSpec, and every workerGroupSpec) has got this shared-drive mounted as a volume, and all have read-write access.

The first time I ran a job on this cluster, which produced objects too big for the ObjectStore and required spilling, all I got were errors like the one I pasted above.

I thought I’d be helpful and manually create this ray_spilled_objects subdirectory (full path: /mnt/shared-drive/ray/ray_spilled_objects), and things started working - I was watching the directory and could see that ray was indeed putting files in here! :white_check_mark:

Later on during the same job, whilst I was watching the contents of this shared drive, I could see that the directory I’d created got deleted :question:

After it was deleted, the same errors started popping up again (i.e. [Errno 2] No such file or directory). :x:

I don’t know where to look in the ray source code to figure out why this subdirectory is getting deleted (or why it’s not automatically created whenever ray tries to spill an object there), but hopefully someone could take a look, and confirm if my diagnosis is actually correct?

In the meantime I might just run a script that periodically ensures that this subdirectory exists - but it’s definitely not a long-term solution :slight_smile:


I’m running ray version 2.7.1 on Ubuntu linux: both on the driver (my laptop), and the cluster (on kubernetes)

Welp, tried looking at the source code for some clues and figured out I could avoid the issue by using the smart_open method for spilling.

So changed from:

{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/mnt/shared-drive/ray\",\"buffer_size\":67108864}}"}

To:

'{"object_spilling_config":"{\"type\":\"smart_open\",\"params\":{\"uri\":\"/mnt/shared-drive/ray\"},\"buffer_size\":67108864}"}'

And now the spilled objects take the form: /mnt/shared-drive/ray/ray_spilled_objects-{hex}-multi-{some-int} (so no needing an extra subdirectory).

Thought I’d report back the workaround in case future users have the same issue, but would be good if this was fixed for the original filesystem method.