PS>> I just started reading about HDF5, which seems to answer most of the questions …
How do I create a LARGE 2D numpy array that has the following specs :
1. Can do DOT product i.e. np.dot(vector,ary2d)
2. CAN use ary2d[rows,cols] syntax to update values
3. CAN resize the array
4. CAN access it by multiple Actors/tasks
My idea so far is ::
To have some sort of Server/daemon app that forks multiple processes.
Split the array to chunks, so that resizing the array is simply adding a new chunk.
Then applying a DOT product for example is iteratively applying it to every chunk and combining the result.
I couldnt find a way in Ray datasets or Apache arrow docs how to UPDATE the numpy array f.e. chunk3[45,:] = vec
Does RAY handle locking of the access to the chunks or i have to do it manually ?
Should I use something like HDF5 instead of Arrow (which has array not np.array … need the np.dot/fft and cython) ?
Sorry for the multi directional questions, if I have to put succinctly what I need is a NUMPY DATABASE.
All projects I’VE checked so far dask,vaex,pytables,arrow and possibly ray dataset seem to be NON-UPDATABLE, NON-RESIZABLE and SINGLE-CLIENT ACCESS projects.
If you can comment on any of the topics with : example, link to docs to read.
I’ve read most of Ray core and Dataset docs and have done some non trivial experiments but the bottleneck is the serial access to a numpy array (and a python dict)
The multi-Actor app was ~3 times slower.
My hope is by chunking the array to allow multi-access and implement resizing (i use np.append() currently)