Maximize available capacity for embarrassingly parallel workloads?

mbehrendt · December 29, 2021, 1:06am

if i’d like to use a ray cluster for a purely embarrassingly parallel workload, i.e. which doesn’t need the global object storage, etc. , is there a way to not include / drastically minimize such components in a particular ray deployment?

sangcho · January 4, 2022, 11:37am

You can use object_store_memory=0 in this case?

Clark_Zinzow · January 6, 2022, 2:16am

@mbehrendt How large is the data in this embarrassingly parallel workload? There’s a good chance that you’ll still want to use Ray’s dataplane (distributed in-memory object store) for an embarrassingly parallel workload unless the data is very small, in which case Ray will bypass the object store automatically. If you observe very low object store usage, you can always change the object store allocation as @sangcho suggested to allocate less RAM to the object store and more RAM to the worker heap.

mbehrendt · January 10, 2022, 12:56am

@Clark_Zinzow I don’t have a single specific case – my question is rather from the perspective of a certain class of workloads. The data might be as little as one URL per task invocation – so for cases like these, it felt like there might be benefit in taking the object storage out of the loop. I also asked since I wasn’t sure whether adding and removing capacity might carry some additional perf penalty with the object storage enabled, from the perspective of adding and removing nodes might cause some syncing/registering/deregistering/… to happen.

Clark_Zinzow · January 10, 2022, 1:21am

The data might be as little as one URL per task invocation – so for cases like these, it felt like there might be benefit in taking the object storage out of the loop.

In this case, the URLs won’t be stored in the object store, they’ll be automatically inlined into the task specification that’s sent to a Ray worker for execution, so the object store is completely bypassed. We do this for all task arguments that are under some configurable threshold (the default is 100KiB).

I also asked since I wasn’t sure whether adding and removing capacity might carry some additional perf penalty with the object storage enabled, from the perspective of adding and removing nodes might cause some syncing/registering/deregistering/… to happen.

There will be a small amount of time spent initializing the object store at node startup, but it should be very negligible in relation to the overall node startup time. In the critical path of executing tasks, if the task arguments and return values are under 100KiB, the object store will be completely bypassed and should therefore not add any overhead.

Topic		Replies	Views
Best practice for processing large amounts of data Ray Core	5	1036	April 10, 2022
Object store memory allocation on cluster Ray Core	3	1460	February 5, 2021
Ray Serve Object Store Memory Issue: ray.exceptions.ObjectStoreFullError Ray Serve	1	495	April 24, 2021
Raylet space running out, despite having plenty of RAM Ray Core	8	3320	March 27, 2023
Does memory usage on ray dashboard include object store usage? Dashboard, Monitoring & Debugging	3	965	January 31, 2023

Maximize available capacity for embarrassingly parallel workloads?

Related topics