Fastest way to share numpy arrays between ray actors and main process

I have a use case where I have to pass huge numpy arrays to and from actors and speed is critical. Most of the communication is between 1 main process used to launch actors and N actors.

I believe ray has optimizations for handling numpy arrays because it uses Apache Arrow under the hood (correct me if I am wrong).
Right now I just feed in/out np.array() objects.

  1. Do I need to do anything special to make sure ray is using the fastest method to pass around numpy arrays? Or does it recognize I am using np.array(), and optimizes accordingly?

  2. Note that for some communication channels I know the size of the arrays and for some I don’t. Is it possible to preallocate memory for the arrays that I know the size of?

Thanks!

Bump.

I’m still not sure if ray/arrow automatically recognizes numpy arrays or if I need to explicitly tell ray my data is np.array().

@mk96 Sorry for the delay.

Yes, Ray optimizes the use of numpy.array. If you store your numpy arrays in the Ray object store, then actors and tasks scheduled on the same node on which the numpy array object resides will get access to it via zero-copy reads.

So you don’t have to do anything special.

@mk96 make sure to optimize your Ray cluster spilling configurations to use multiple threads to spill to disk: quokka/leader_start_ray.sh at master · marsupialtail/quokka · GitHub