ray::IDLE_SpillWorker memory consumption and OOM

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello!

I was trying to find some information regarding the ray::IDLE_SpillWorker. It seems to be a Ray Actor that spills to disk, but it’s unclear wether this is from the Object Store memory, or from a Ray Dataset.

I’m running a Ray Job in a cluster that handle large datasets across multiple nodes and most of the memory is being used by these Actors. Eventually, the node runs out of memory and the job fails. My guess is that by moving data in smaller batches with ray.put or increasing the number of blocks of a dataset it won’t spill to disk as much, but since there is not much information about it, I was hoping someone could help.

Thanks!

Both sounds like reasonable approaches; you basically want to limit the size of data that Ray processes at runtime and let GCS handle the memory re-use for you (to limit the amount of spill).

This should be fairly straightforward to test and benchmark.

RE: your question about whether it’s from Object Store or Ray Dataset; Ray DS is implemented (and expects) a Ray Cluster so it’s using OS underneath the covers.

Thanks for the response! I still don’t understand what would an adequate number of Ray Dataset batches be to optimize resources without hindering performance with excessive overhead and spilling. Are there any guidelines for this?

Start here: Advanced: Performance Tips and Tuning — Ray 2.35.0 - there are some sections explicit to tuning for memory spilling/consumption (to avoid OOMs, the two subjects are correlated ofc).

1 Like

Thanks for the link! Will thoroughly read it. Increasing the number of blocks of the Ray Dataset did the thing (although for reading operations, it can trigger a Recursion Error if the number of batches is too large).

I’m following up with this issue, since I’m seeing a weird behavior. I’m currently not running any jobs or actors running in the cluster, but ray::IDLE and ray:: IDLE_SpillWorker are using most of the memory.

Any idea how these can be freed up or why this is hapenning?