Ray distributed memory parallelism


When it comes to creating a ray cluster, is there a way to limit the number of workers per node? My remote function generates large amounts of data and I want to avoid having many threads on the same node.

Hi @wrios - great question!

In general, each ray node will be started by ray start or ray.init(), and you could pass “num_cpus” to specify the logical number of CPUs that this node has. This will be equivalent to the maximal number of workers processes that Ray starts. (Do note that for each ray worker process, there might be multiple threads being used)

And, how are you creating the cluster? I could probbaly provide more specific help if I know that.

Thank you for your interest in helping. To set up the cluster I am using a similar script as depicted in this example. Because a remote function can generate large amounts of data before it returns something, as occur in wave simulation where the history might need to be saved, it can fill up the node memory very easily if I have several ray workers in the same node. I was just wondering if there is a way to enforce ray workers to be distributed across nodes so as to avoid this to happen.

I’m just an amateur Ray user. But have you considered a centrally available location for worker spillover and/or consider enabling dask on ray? This way if the worker fills up, the information is dumped to disk and can be picked up again by the same worker (or another worker if the node or worker fails).