Dask object size vs ray.experimental.array

photoszzt · March 3, 2021, 8:21pm

I’m not sure whether it’s appropriate to ask here. I have a matrix multiplication program with two implementations, one uses Dask on ray and the other one uses Ray.experimental.array(I modify its DistArray definition with a more flexible block size here ray/core.py at master · photoszzt/ray · GitHub). The chunk size is the same. When using ray memory to view the object size, I find that dask’s object size is always twice as large as the ray.experimental.array. Does anyone know the reason?

from timeit import default_timer as dtimer
import ray
from ray.util.dask import ray_dask_get
import dask
import dask.array as da
import numpy as np


def main():
    _ = ray.init()
    dask.config.set(scheduler=ray_dask_get)
    chunks=(2500, 2500)
    x = da.random.rand((5000, 5000), chunks=chunks)
    y = da.random.rand((5000, 5000), chunks=chunks)
    start = dtimer()
    z = da.matmul(x, y)
    z_val = z.compute()
    end = dtimer()
    print(z_val)
    print(f"time: {end-start} s")


if __name__ == "__main__":
    main()

ray.experimental.array version:

from timeit import default_timer as dtimer
import ray
import ray.experimental.array.distributed as rda
import numpy as np


def main():
    chunks=(2500, 2500)
    x = np.random.rand((5000, 5000))
    y = np.random.rand((5000, 5000))
    x = rda.numpy_to_dist.remote(npx, chunks=chunks)
    y = rda.numpy_to_dist.remote(npy, chunks=chunks)
    z = rda.dot.remote(x, y)
    z_val = ray.get(z)
    z_val = z_val.assemble()
    end = dtimer()
    print(z_val)
    print(f"time: {end-start} s")


if __name__ == "__main__":
    main()

rliaw · March 3, 2021, 10:26pm

Hmm, not sure.

On a meta-note, I think the Ray distributed array API is not really well-maintained. You may want to try dask-on-ray instead? Dask on Ray — Ray v2.0.0.dev0

cc @pcmoritz @Clark_Zinzow

photoszzt · March 3, 2021, 10:41pm

I’m using Dask on Ray and compared it with the one come with ray. My question is about the object size. For a 2500x2500 float64 matrix, the size is 250025008/1024/1024 ~= 47 MB. I’m seeing similar number for object on ray.experimental.array. But on Dask on Ray, the object size is 95 MB. I’m not sure whether it contains 2 matrix or the matrix get replicated. Dask on Ray uses twice amount of memory compared to ray.experimental.array. I can’t tell what’s inside the object from the ray memory output.

Clark_Zinzow · March 3, 2021, 11:11pm

It could be that Dask is doing some fancy chunk coalescing or other memory optimizations that result in larger individual Ray objects than expected (e.g. they have an optimization to avoid blockwise contractions). Are all Ray objects comprising the Dask-on-Ray workload also taking up 2x more memory than the ray.experimental.array version, in total?

Also, if you’re willing to try out Dask-on-Ray on master, you could call z.persist(), which will inline the Ray objects into the Dask collection and allow you to inspect the individual objects within the Dask graph with ray.get().

photoszzt · March 3, 2021, 11:24pm

By parsing the raylog.out, the peak dask on ray memory is twice more memory than the ray.experimental.array version. I’m looking at the “num bytes in use” line for this information.

photoszzt · March 4, 2021, 6:54am

Hmm, I run the code one more time with ray master. Now it’s not 2 times. It’s much closer. dask uses 810 MB and ray.experimental.array uses 762 MB. Not sure what changed.

photoszzt · March 4, 2021, 6:57am

I try the z.persist method and print out the object. It seems it only contains the final parts. The size of the object matches my expectation. The object that’s 2x large might be at the earlier stage.

Topic		Replies	Views
Converting dask dataframe/array to ray dataset Ray Data	3	799	April 18, 2022
Ray for linear algebra use case	0	506	September 15, 2022
Dask on ray does not work with dask dataframes Ray Clusters	0	158	February 1, 2024
How ray memory works on clusters? Ray Clusters	5	344	March 23, 2022
Ray dataset creating 2 objects per file read, leading to double memory consumption Ray Data	1	435	May 3, 2022

Dask object size vs ray.experimental.array

Related topics