Simultaneous numpy matrix vector multiplication

- Low: It annoys or frustrates me for a moment.

What I want to do

I have an m x n numpy array A where m << n that I want to load on a node where all 20 CPUs on that node can share memory. On each CPU, I want to multiply A by a n x 1 vector v, where the vector v differs on each CPU but matrix A stays the same.

Constraint

Matrix A is sufficiently large so that I cannot load A on each CPU, so I would like to put A in shared node memory. And since A*v is just m x 1, I think I never need to store matrices of size m x n on each CPU (just one copy of A in shared memory).

My question

If I have 1 worker per CPU, can each worker simultaneously compute A x v (where v is different for each worker) using Ray?

I am concerned that since I am simultaneously accessing the same shared memory by each worker, Ray would copy matrix A into each CPU, which would cause memory problems for me.

Note
I had previously asked this question on StackOverflow. I have amended the question on StackOverflow, removing the part about Ray, because I just discovered that Ray has its own forum here.

If A is integer based numpy array, it is zero copied to each worker. Serialization — Ray 3.0.0.dev0

Also note that A is immutable if you are putting it to a shared memory using ray.put!

Thanks!

i. Just to be clear, zero copy means that I don’t need to store n_CPU x size of A in memory over the whole node (I just need to store 1 copy of A)?

ii. Could you explain the integer part a bit more? I did not see a reference to integers in the link you mentioned. Additionally, the numpy array I will be using consists of floats (not just integers) – would that work as well?

iii. Could you explain the part about serialization? My goal is to simultaneously access A across all workers (at the same time); serialization makes me think that one worker uses A for matrix vector multiplication, then the next worker uses A, etc., which is what I am hoping to avoid with Ray. Can I access a numpy float array A and use it for matrix vector multiplication simultaneously across all workers?

Thanks a lot for all the developments on Ray! It seems like an awesome package.

That’s right! If you are finding that this is not the case, i.e. you’re running out of memory or the memory usage is much higher than size of A + (n * size of v), please report back here with your code as this is not expected.

Yes, to clarify, zero-copy serialization should work for numpy in general as long as you are using primitive dtypes (ints, floats, bytes, etc). It does not work for the 'O'type, since these are arbitrary Python objects.

Yes, in the following code, the workers will not need to copy A when they receive it as a task argument because they receive a pointer to the numpy array stored in shared memory.

import ray

@ray.remote
def multiply(A, v):
    return A * v

A_ref = ray.put(A)  # Put A in Ray's shared-memory object store.
refs = [multiply.remote(A_ref, v) for v in vs]
results = ray.get(refs)

The one caveat is that the copy of A is immutable, so if you need to make a fine-grained update, this will produce multiple distinct copies of A, one per task:

@ray.remote
def update(A, i, x):
  A[i] = x
  return A  # This is a distinct object from the original A.
2 Likes

Thank you! And thank you all for the awesome functionalities of Ray!

1 Like