Ray dataset creating 2 objects per file read, leading to double memory consumption

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
    High: It blocks me to complete my task.

I tried reading multiple text files using ray’s dataset api and noticed that for every file read, ray created 2 objects with their individual size equal to the input file size. However, this is not the case when I am putting a numpy array to memory using ray.put. I am attaching my code as well as the corresponding ray memory output below:

import ray
from ray.internal.internal_api import memory_summary

ray.init(address="auto")

# currently has 3 files of 850 MiB each
dir_path = "input/files/path"

ds = ray.data.read_text(dir_path)
print(ds.count())
print(memory_summary())

ray.shutdown()

The output of the above code is as follows:

1532898
Grouping by node address...        Sorting by object size...        Display allentries per group...


--- Summary for node address: 127.0.0.1 ---
Mem Used by Objects  Local References  Pinned        Pending Tasks  Captured in Objects  Actor Handles
5365140705.0 B       6, (5365140705.0 B)  0, (0.0 B)    0, (0.0 B)     0, (0.0 B)           2, (0.0 B)

--- Object references for node address: 127.0.0.1 ---
IP Address       PID    Type    Call Site               Size    Reference Type      Object Ref
127.0.0.1        61351  Worker  disabled                ?       ACTOR_HANDLE        ffffffffffffffff54c910c96952127076bc5c6d0100000001000000

127.0.0.1        61347  Driver  (actor call)  | /opt/a  ?       ACTOR_HANDLE        ffffffffffffffff54c910c96952127076bc5c6d0100000001000000
                                naconda3/envs/snappy-f
                                ix/lib/python3.7/site-
                                packages/ray/data/impl
                                /stats.py:get_or_creat
                                e_stats_actor:122 | /o
                                pt/anaconda3/envs/snap
                                py-fix/lib/python3.7/s
                                ite-packages/ray/data/
                                read_api.py:read_datas
                                ource:226 | /opt/anaco
                                nda3/envs/snappy-fix/l
                                ib/python3.7/site-pack
                                ages/ray/data/read_api
                                .py:read_binary_files:
                                607

127.0.0.1        61347  Driver  (task call)  | /opt/an  889484831.0 B  LOCAL_REFERENCE     f4402ec78d3a2607ffffffffffffffffffffffff0100000001000000
                                aconda3/envs/snappy-fi
                                x/lib/python3.7/site-p
                                ackages/ray/data/read_
                                api.py:<lambda>:275 |
                                /opt/anaconda3/envs/sn
                                appy-fix/lib/python3.7
                                /site-packages/ray/dat
                                a/impl/lazy_block_list
                                .py:_get_or_compute:14
                                9 | /opt/anaconda3/env
                                s/snappy-fix/lib/pytho
                                n3.7/site-packages/ray
                                /data/impl/lazy_block_
                                list.py:__next__:133

127.0.0.1        61347  Driver  (task call)  | /opt/an  890501944.0 B  LOCAL_REFERENCE     8849b62d89cb30f9ffffffffffffffffffffffff0100000001000000
                                aconda3/envs/snappy-fi
                                x/lib/python3.7/site-p
                                ackages/ray/data/impl/
                                compute.py:<listcomp>:
                                85 | /opt/anaconda3/en
                                vs/snappy-fix/lib/pyth
                                on3.7/site-packages/ra
                                y/data/impl/compute.py
                                :apply:85 | /opt/anaco
                                nda3/envs/snappy-fix/l
                                ib/python3.7/site-pack
                                ages/ray/data/impl/pla
                                n.py:__call__:285

127.0.0.1        61347  Driver  (task call)  | /opt/an  893689159.0 B  LOCAL_REFERENCE     e0dc174c83599034ffffffffffffffffffffffff0100000001000000
                                aconda3/envs/snappy-fi
                                x/lib/python3.7/site-p
                                ackages/ray/data/read_
                                api.py:<lambda>:275 |
                                /opt/anaconda3/envs/sn
                                appy-fix/lib/python3.7
                                /site-packages/ray/dat
                                a/impl/lazy_block_list
                                .py:_get_or_compute:14
                                9 | /opt/anaconda3/env
                                s/snappy-fix/lib/pytho
                                n3.7/site-packages/ray
                                /data/impl/lazy_block_
                                list.py:__next__:133

127.0.0.1        61347  Driver  (task call)  | /opt/an  894711144.0 B  LOCAL_REFERENCE     82891771158d68c1ffffffffffffffffffffffff0100000001000000
                                aconda3/envs/snappy-fi
                                x/lib/python3.7/site-p
                                ackages/ray/data/impl/
                                compute.py:<listcomp>:
                                85 | /opt/anaconda3/en
                                vs/snappy-fix/lib/pyth
                                on3.7/site-packages/ra
                                y/data/impl/compute.py
                                :apply:85 | /opt/anaco
                                nda3/envs/snappy-fix/l
                                ib/python3.7/site-pack
                                ages/ray/data/impl/pla
                                n.py:__call__:285

127.0.0.1        61347  Driver  (task call)  | /opt/an  897863466.0 B  LOCAL_REFERENCE     32d950ec0ccf9d2affffffffffffffffffffffff0100000001000000
                                aconda3/envs/snappy-fi
                                x/lib/python3.7/site-p
                                ackages/ray/data/read_
                                api.py:<lambda>:275 |
                                /opt/anaconda3/envs/sn
                                appy-fix/lib/python3.7
                                /site-packages/ray/dat
                                a/impl/lazy_block_list
                                .py:__init__:41 | /opt
                                /anaconda3/envs/snappy
                                -fix/lib/python3.7/sit
                                e-packages/ray/data/re
                                ad_api.py:read_datasou
                                rce:279

127.0.0.1        61347  Driver  (task call)  | /opt/an  898890161.0 B  LOCAL_REFERENCE     f91b78d7db9a6593ffffffffffffffffffffffff0100000001000000
                                aconda3/envs/snappy-fi
                                x/lib/python3.7/site-p
                                ackages/ray/data/impl/
                                compute.py:<listcomp>:
                                85 | /opt/anaconda3/en
                                vs/snappy-fix/lib/pyth
                                on3.7/site-packages/ra
                                y/data/impl/compute.py
                                :apply:85 | /opt/anaco
                                nda3/envs/snappy-fix/l
                                ib/python3.7/site-pack
                                ages/ray/data/impl/pla
                                n.py:__call__:285

To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1

--- Aggregate object store stats across all nodes ---
Plasma memory usage 2550 MiB, 3 objects, 124.55% full, 41.47% needed
Plasma filesystem mmap usage: 1701 MiB
Spilled 4267 MiB, 5 objects, avg write throughput 644 MiB/s
Objects consumed by Ray tasks: 2556 MiB.

Ray version is 2.0.0 (nightly) but this is being observed in ray 1.9.0 as well.
The object store memory is set to 2 GiB (max for Mac), each input file is approx 850 MiB in size and there are 3 input files in the aforementioned scenario.

Any idea why this is taking place. Is this due to some sort of replication within ray? If so, how can we disable it?

I believe this issue was caused since we would initially read the first file for a dataset to inspect its schema. Subsequent reads would read the full dataset.

I think this is fixed in master— when I tried again with ds = ray.data.read_parquet(); ds.fully_executed(), there were no duplicate objects when I ran ray memory. cc @Clark_Zinzow