Best practice for processing large amounts of data

is there any guidance/best practice wrt the size to be put on the internal object storage?

E.g. if i have TBs or PBs of data — is the guidance to stand up as much compute/mem as possible so i can fit all data into memory, or is there a tipping point where it’s of advantage to just let the data on an external (regular) object storage, and/or even use that as an intermediate store?

When it comes to using spill-on-disk - that probably wouldn’t make sense if object storage was used as that “disk” for the original data in the first place?

I have the same problem in my application. I try to manage disk spilling in an intelligent way so that data on the hot-path is in memory. If you have TB/PB of data, it’s unlikely you need to compute all of it at the same time. If you go through your data in some predictable order you can manage the disk flushing/retrieval in an intelligent way. Or better yet don’t fetch your data from cloud storage until you are going to need it in the next 20 seconds or something.

Will defer to official Ray folks for their opinion. But happy to chat if you want to PM.

@Chen_Shen What do you think? Thanks!

hi @mbehrendt:

To get the best performance of Ray, my suggestion is to try to ensure your working set fits into memory and avoid spilling when possible.

Though Ray has a couple of optimizations to avoid spilling/restoring blocking the computation, there is always a possibility Ray spilled the wrong object, which will be restored for later execution. At a high level, Ray can’t make optimal scheduling decisions because of the following:

  1. Ray’s tasks are dynamically created, so it doesn’t know the full computation graph;
  2. Each task’s result size is not known before execution.

If the application knows the above mentioned information, it’s best to let the application keep the working set fits into memory. An alternative is to use existing applications/libraries that already optimize memory usage, such as Ray Dataset.

On the flip side, when spilling is inevitable, a common practice to allocate sufficient memory so that objects might be spilled, but there is no object restoration. (The number of bytes spilled or restored could be queried by ray memory --stats-only.) Our experience is that restoring objects will dramatically slow down the overall performance compared to spilling only.

Only my 2 cents. @Stephanie_Wang @Clark_Zinzow might also provide some insight for this question.

Yes, it’s a hard question, and I agree with @Chen_Shen on his reply.

To add on, Ray’s memory hierarchy for objects is like a caching system, and similar to other caching systems, there is always going to be a tradeoff between cost vs capacity for each cache tier. So wherever possible, you want to try to structure your application to keep objects in higher tiers, but it may not always be possible. The tiers are like this, along with what working set size you need to think about:

  1. A Python worker’s heap memory → The working set size for a single object or task.
  2. A Ray node’s object store → The working set size for a group of parallel tasks.
  3. A Ray cluster’s disk capacity → The working set size for the whole application.
1 Like

Like @Chen_Shen said Ray does not know the entirety of the computational graph. Due to this fact is essential to avoid overwhelming the runtime with a great number of tasks. Being useful to specify the needed resources (CPU, memory) per task and also to control the number of created and inflight tasks.

For larger datasets usually my preoccupation is to control the resources needed by tasks and actors (specifying the resources needed). Also in large loops of tasks (5k++) to control the number of submitted tasks and inflight tasks, using ray.get to control the number of submitted tasks and ray.wait the number of inflight tasks.