@vitsai - am getting a similar, possibly-related, issue, when running a ray cluster on a kubernetes cluster managed by kube-ray
, e.g.
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 2116, in ray._raylet.spill_objects_handler
File "python/ray/_raylet.pyx", line 2119, in ray._raylet.spill_objects_handler
File "/opt/conda/lib/python3.10/site-packages/ray/_private/external_storage.py", line 668, in spill_objects
return _external_storage.spill_objects(object_refs, owner_addresses)
File "/opt/conda/lib/python3.10/site-packages/ray/_private/external_storage.py", line 304, in spill_objects
with open(url, "wb", buffering=self._buffer_size) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/shared-drive/ray/ray_spilled_objects/e4eb5f8f53e3402f9d3e77543074f3af-multi-45'
An unexpected internal error occurred while the IO worker was spilling objects: [Errno 2] No such file or directory: '/mnt/shared-drive/ray/ray_spilled_objects/e4eb5f8f53e3402f9d3e77543074f3af-multi-45'
My RayCluster
CR has this in the config:
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: ...
spec:
rayVersion: "2.7.1"
enableInTreeAutoscaling: true
autoscalerOptions: {}
headGroupSpec:
rayStartParams:
system-config: >-
'{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/mnt/shared-drive/ray\",\"buffer_size\":67108864}}"}'
Which I put together using help from the docs.
Every pod (the headGroupSpec, and every workerGroupSpec) has got this shared-drive mounted as a volume, and all have read-write access.
The first time I ran a job on this cluster, which produced objects too big for the ObjectStore and required spilling, all I got were errors like the one I pasted above.
I thought I’d be helpful and manually create this ray_spilled_objects
subdirectory (full path: /mnt/shared-drive/ray/ray_spilled_objects
), and things started working - I was watching the directory and could see that ray was indeed putting files in here!
Later on during the same job, whilst I was watching the contents of this shared drive, I could see that the directory I’d created got deleted
After it was deleted, the same errors started popping up again (i.e. [Errno 2] No such file or directory
).
I don’t know where to look in the ray source code to figure out why this subdirectory is getting deleted (or why it’s not automatically created whenever ray tries to spill an object there), but hopefully someone could take a look, and confirm if my diagnosis is actually correct?
In the meantime I might just run a script that periodically ensures that this subdirectory exists - but it’s definitely not a long-term solution
I’m running ray version 2.7.1 on Ubuntu linux: both on the driver (my laptop), and the cluster (on kubernetes)