I’m using ray as a backend with Modin’s out of core feature. Unfortunately I still see memory error with message it cannot allocate memory to object store. I realized that I’m using a lot of numpy arrays that are using memory but for some reason they aren’t being spilled to disk.
So I think what might be happening is I have a mix of ray code (Modin) and non-ray code numpy and the additional numpy code increases memory pressure and causes ray to run out of memory / not be able to allocate any new objects to the object store at some point.
Is there a way to configure ray to always spill to external storage or disk so the program doesn’t run out of memory?
Digging into documentation I found I could possibly it to s3 bucket but I’d like to know if this is possible or am on the right track.
ray.init(
_system_config={
“max_io_workers”: 4, # More IO workers for remote storage.
“min_spilling_size”: 100 * 1024 * 1024, # Spill at least 100MB at a time.
“object_spilling_config”: json.dumps(
{“type”: “smart_open”, “params”: {“uri”: “s3:///bucket/path”}},
)
},
)
Hi, the disk spilling is more recommended than S3 spilling right now (we need more performance improvement in S3 spilling). You can tried the disk spilling here instead. Memory Management — Ray v2.0.0.dev0
The current best practice for using Modin (installed from github master) would be to initialize Ray with a large plasma store and have the plasma directory point to disk. That would ensure the object store is larger than memory and then the operating system would page in objects. It’s not as efficient as it could be, but at worst I’ve observed 50-60% slower than pure in-memory performance (despite the 10x overhead of going to disk from memory). Usually this is worth it.