How does Ray do simultaneous matrix-vector multiplications?

None: Just asking a question out of curiosity

Context

This is a continuation of Simultaneous numpy matrix vector multiplication .

Basically I have one node with a large matrix in stored in shared memory. This node has 20 CPUs and each CPU has a different set of vectors, and they all multiply the large matrix.

Question

The response to the question linked mentioned that all the CPUs can simultaneously access the large matrix (rather than having one CPU use the matrix, then the next one, etc.). Out of curiosity, how does this happen?

More context

For example, does Ray tell each CPU to work on different parts of the matrix, so that each part of the large matrix is only being accessed by one CPU at any point in time (like CPU 1 first multiples the first row and the vector in CPU 1, CPU 2 first multiples the second row and the vector in CPU 2, etc., then CPU 1 multiplies the second row by the vector in CPU 1, CPU 2 multiplies the third row and the vector in CPU 2, etc.?

Or do all CPUs basically access the same element at the same time (i.e., all CPUs do first row times their vector, then second row times their vector, etc.)?

As long as there are no writes to a memory region, many CPUs can concurrently access the memory region at the same time. In Ray’s case, objects in the object store are immutable, so writes aren’t possible. If the input matrices are stored in a format with which Ray knows how to do zero-copy (e.g. numpy), then your Ray workers can all access any part of the input matrices concurrently as if they were the only worker.

Writing is much more complicated, since the ordering of writes is usually nondeterministic and thus must be synchronized. Luckily, in matrix multiplication the set of outputs can be sharded over the set of CPUs, such that each output is written to by a single CPU. This means that synchronization only needs to happen at the reduce stage (after each output is computed), which very manageable.

This answer assumes a single node; it generalizes to multiple nodes however more copies are needed so workers on other nodes can read the input matrices.

1 Like