Can I define schema when calling ray.data.read_json

Hi,
I am getting an exception below when calling ray.data.read_json due to inconsistent number of fields in json files. Is there a way to set a predefined schema for all possible fields in advance when calling read_json?
Thanks,

Exception:
pyarrow.lib.ArrowInvalid: Unable to merge: Field content has incompatible types:

Something like:
import pyarrow as pa
schema=pa.schema([
(‘some_int’, pa.int32()),
(‘some_string’, pa.string())
])
ds = ray.data.read_json(‘some_dir/’, schema=schema)

Hi @James_Liu - thanks for the question. It seems we don’t support defining a schema when reading JSON file.

A little more search I found PyArrow supports to explicitly define a schema when reading json - pyarrow.json.ParseOptions — Apache Arrow v9.0.0

https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html

Could you help try to read your JSON file with PyArrow read_json() and its ParseOptions.explicit_schema?

If PyArrow works for you, we can add a parse_options into Ray Datasets read_json() to pass the parse_options to the underlying PyArrow reader. Thanks.

1 Like

Hey @James_Liu, it should also be noted that you can pass any pyarrow.json.read_json() arguments as keyword arguments to ray.data.read_json(), and Datasets will propagate those arguments to the Arrow read_json() call:

import pyarrow as pa
import pyarrow.json as json

schema=pa.schema([
    (‘some_int’, pa.int32()),
    (‘some_string’, pa.string())
])

ds = ray.data.read_json("some_dir/", parse_options=json.ParseOptions(explicit_schema=schema))

Ah but there will probably be an issue here with Pickle serialization: pa.json.ParseOptions() wasn’t made picklable until Arrow 8.0.0, and we have only added a workaround for pa.json.ReadOptions(), not pa.json.ParseOptions(), so I would expect use of pa.json.ParseOptions() to fail with a Pickle error.

We should definitely expand the workaround to include pa.json.ParseOptions().

Hi @James_Liu - just FYI, the issue with Arrow JSON ParseOptions is fixed in Ray master https://github.com/ray-project/ray/pull/27911 .

2 Likes