Memory sharing with nested list

In our ML project, we assign a large non-standard Python object (nested list) to processes in parallel. The problem is that this nested list (e.g. 3 GB) is first loaded into the Ray object store and then it will be copied for each child process, i.e. memory sharing not does not work between object store and child processes. Below you find a toy example. Our users work on Windows machines. Can you please help me to find a solution to this problem, which often leads to program abortion because we run out of memory?

# -*- coding: utf-8 -*-
"""
Created on Wed Mar 10 15:25:36 2021.

@author: HBodory
"""

import numpy as np
import ray
import time
from scipy.sparse import lil_matrix


@ray.remote
def toy_function(x, nested_list, sparse_matrix):
    """Test."""
    idx = np.array(list(range(30)))
    data = np.array([i / 1 for i in range(30)], dtype="float32")
    sparse_matrix[0, idx] = data
    sparse_matrix[1, 30:60] = sparse_matrix[0, :30]
    for i in range(1 * 10**2):
        m1 = np.random.random((100, 100))
        matrix = m1 @ np.transpose(m1)
        sol = np.linalg.inv(matrix)
        c_sparse = sparse_matrix[0, 2]
        result = x + nested_list[0][0]
    return result, sol, c_sparse


if __name__ == "__main__":

    N = 10**7

    ray.init(num_cpus=4, include_dashboard=False)

    # Create a nested list
    nested_list = [list(range(N)), list(range(2*N))]
    nested_list_id = ray.put(nested_list)

    sparse_matrix = lil_matrix((N, N), dtype=np.float32)
    sparse_matrix[0, :30] = range(30)
    sparse_matrix[1, 30:60] = sparse_matrix[0, :30]
    sparse_matrix = ray.put(sparse_matrix)

    t0 = time.time()
    result = ray.get([toy_function.remote(i, nested_list_id, sparse_matrix)
                      for i in range(10)])
    t1 = time.time() - t0
    print(f"Execution time: {t1} sec.")
    ray.shutdown()

Hi @hbodory, thanks for the question! I’m not sure what’s causing this issue, as your code sample seems to be following the recommendation at Tips for first-time users — Ray v2.0.0.dev0 to the letter. Is it possible for you to use a numpy array or collection of numpy arrays instead of a nested list? That may save on deserialization costs in the workers: Ray Core Walkthrough — Ray v2.0.0.dev0

@Stephanie_Wang any ideas about this?

If you are talking about zero-copy read, it only works on data structure that supports pickle 5 protocol (e.g., numpy array). I believe the scipy matrix probalby uses the numpy under the hood, but for the python list, it doesn’t work (it is copied to the process memory when it is used).

We thought about using objects that support the pickle 5 protocol for zero-copy read. But if we used, for example, numpy arrays instead of the single nested list object, we would have to generate tens of thousends or even hundreds of thousends numpy arrays, depending on the data size, number of trees, leaf sizes, and so on. Is there a way to handle such a large collection of numpy arrays in a similar (simple) fashion like a single nested list?

Dear @architkulkarni, thanks for you recommendation regarding the use of numpy arrays. We thought about using objects that support the pickle 5 protocol for zero-copy read. But if we used, for example, numpy arrays instead of the single nested list object, we would have to generate tens of thousends or even hundreds of thousends numpy arrays, depending on the data size, number of trees, leaf sizes, and so on. Is there a way to handle such a large collection of numpy arrays in a similar (simple) fashion like a single nested list?

Dear @sangcho, thank you very much for your reply. Please see my answer sent to @architkulkarni.

I think if you are storing the numpy array in a nested list, only the list part is copied (and the numpy buffer itself would be zero-copy read). @suquark can you confirm this?

Yes, only the list part is copied

Dear @sangcho, I ran some small tests, you are right, storing numpy arrays in a nested list can lead to improvements in several dimensions, e.g. (i) requiring less memory for the nested list itself, (ii) saving RAM for the child processes because of zero-copy reading, and (iii) reducing the run time. Thank you very much for your valuable and helpful support.

1 Like

Glad it helped!! And thanks for the verification :)!!