Fastest way to share numpy arrays between ray actors and main process

mk96 · March 29, 2023, 12:22am

I have a use case where I have to pass huge numpy arrays to and from actors and speed is critical. Most of the communication is between 1 main process used to launch actors and N actors.

I believe ray has optimizations for handling numpy arrays because it uses Apache Arrow under the hood (correct me if I am wrong).
Right now I just feed in/out np.array() objects.

Do I need to do anything special to make sure ray is using the fastest method to pass around numpy arrays? Or does it recognize I am using np.array(), and optimizes accordingly?
Note that for some communication channels I know the size of the arrays and for some I don’t. Is it possible to preallocate memory for the arrays that I know the size of?

Thanks!

mk96 · April 11, 2023, 8:12pm

Bump.

I’m still not sure if ray/arrow automatically recognizes numpy arrays or if I need to explicitly tell ray my data is np.array().

Jules_Damji · April 11, 2023, 8:58pm

@mk96 Sorry for the delay.

Yes, Ray optimizes the use of numpy.array. If you store your numpy arrays in the Ray object store, then actors and tasks scheduled on the same node on which the numpy array object resides will get access to it via zero-copy reads.

So you don’t have to do anything special.

marsupialtail · April 13, 2023, 4:18am

@mk96 make sure to optimize your Ray cluster spilling configurations to use multiple threads to spill to disk: quokka/leader_start_ray.sh at master · marsupialtail/quokka · GitHub

Topic		Replies	Views
[Core] How to share memory with non-numpy object? Ray Core	6	1362	May 7, 2021
Shared numpy array Ray Core	1	1247	July 19, 2022
Leaking worker memory Ray Core	9	441	February 19, 2021
Ray ray.get very slow with distributed actors holding vectorial data Ray Core	0	161	January 21, 2024
NUMPY database?	1	315	August 26, 2022

Fastest way to share numpy arrays between ray actors and main process

Related topics