Can I define schema when calling ray.data.read_json

James_Liu · August 11, 2022, 3:19am

Hi,
I am getting an exception below when calling ray.data.read_json due to inconsistent number of fields in json files. Is there a way to set a predefined schema for all possible fields in advance when calling read_json?
Thanks,

Exception:
pyarrow.lib.ArrowInvalid: Unable to merge: Field content has incompatible types:

Something like:
import pyarrow as pa
schema=pa.schema([
(‘some_int’, pa.int32()),
(‘some_string’, pa.string())
])
ds = ray.data.read_json(‘some_dir/’, schema=schema)

chengsu · August 11, 2022, 11:31pm

Hi @James_Liu - thanks for the question. It seems we don’t support defining a schema when reading JSON file.

A little more search I found PyArrow supports to explicitly define a schema when reading json - pyarrow.json.ParseOptions — Apache Arrow v9.0.0

https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html

Could you help try to read your JSON file with PyArrow read_json() and its ParseOptions.explicit_schema?

If PyArrow works for you, we can add a parse_options into Ray Datasets read_json() to pass the parse_options to the underlying PyArrow reader. Thanks.

Clark_Zinzow · August 12, 2022, 1:20am

Hey @James_Liu, it should also be noted that you can pass any pyarrow.json.read_json() arguments as keyword arguments to ray.data.read_json(), and Datasets will propagate those arguments to the Arrow read_json() call:

import pyarrow as pa
import pyarrow.json as json

schema=pa.schema([
    (‘some_int’, pa.int32()),
    (‘some_string’, pa.string())
])

ds = ray.data.read_json("some_dir/", parse_options=json.ParseOptions(explicit_schema=schema))

Clark_Zinzow · August 12, 2022, 1:25am

Ah but there will probably be an issue here with Pickle serialization: pa.json.ParseOptions() wasn’t made picklable until Arrow 8.0.0, and we have only added a workaround for pa.json.ReadOptions(), not pa.json.ParseOptions(), so I would expect use of pa.json.ParseOptions() to fail with a Pickle error.

We should definitely expand the workaround to include pa.json.ParseOptions().

chengsu · August 22, 2022, 7:22pm

Hi @James_Liu - just FYI, the issue with Arrow JSON ParseOptions is fixed in Ray master https://github.com/ray-project/ray/pull/27911 .

Topic		Replies	Views
Does ray.data.read_json() support reading from HDFS?	4	692	July 24, 2023
Ray.data read_parquet ‘tensor_column_schema’ argument issue	1	440	February 11, 2023
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays	0	1933	July 24, 2023
[Dataset] Ray Dataset reading multiple parquet files with different columns crashes due to TProtocolException: Exceeded size limit Ray Data	14	1974	November 17, 2022
Write Parquet adds new column value Ray Data	11	1271	April 17, 2023

Can I define schema when calling ray.data.read_json

Related topics