DataSource blocks from Arrow RecordBatchReader?

chris_snow · February 18, 2025, 7:50am

I have a backend that is capable of returning an Arrow RecordBatchReader.

Can I make blocks using the read_next_batch method so that batches are processed in a distributed fashion across the ray cluster.

I think the RecordBatchReader would need to be a singleton across the cluster and distribute each calls to read_next_batch() across the workers?

rliaw · February 20, 2025, 5:28am

Hi @chris_snow - what are you trying to do exactly?

Depending on the throughput you need you could easily farm this out via ray.remote calls

chris_snow · February 20, 2025, 1:55pm

I have an api to query for my vast data lakehouse database:

with session.transaction() as tx:
    try:
        schema = tx.bucket(DATABASE_NAME).schema(name=DATABASE_SCHEMA, fail_if_missing=False)
        if schema:
            try:
                table = schema.table(name=TABLE_NAME)
                reader = table.select(predicate=predicate) # RecordBatchReader

I would like data processing to scale across multiple ray nodes.

Currently, the Vast python api doesn’t support sharding the reads to multiple clients.

chris_snow · February 21, 2025, 9:02pm

A bit more info on the Vast interaction with RecordBatchReader:

https://vast-data.github.io/data-platform-field-docs/vast_database/sdk_ref/15_recordbatchreader.html

Any thoughts on this @rliaw? Should I just bug my engineering team to update the client with a shard id?

rliaw · February 22, 2025, 1:51am

you should implement your own custom datasource.

See: Loading Data — Ray 2.42.1

here is an example datasource: ray/python/ray/data/_internal/datasource/mongo_datasource.py at master · ray-project/ray · GitHub

and then you can do something like:

read_datasource(VastDatasource(…))

chris_snow · February 22, 2025, 1:26pm

thanks @rliaw. I had tried the datasource approach (here) but without vastdb client support for partitioning the query response, it appeared the only options seems to be to read the whole query response into a single node.

rliaw · February 24, 2025, 6:48pm

If not, then maybe this?

https://docs.ray.io/en/latest/data/api/doc/ray.data.from_arrow.html

Topic		Replies	Views
Ray ray.get very slow with distributed actors holding vectorial data Ray Core	0	167	January 21, 2024
Why isn't `ray.data.read_api._get_reader` parallelized?	0	193	December 5, 2023
Distributed training with different number of batches Ray Data	0	27	May 11, 2025
Running batches of data by multiple work process Ray Core	5	534	April 6, 2022
Data set access per range by worker process	0	357	April 5, 2022

DataSource blocks from Arrow RecordBatchReader?

Related topics