Hi,
I am getting an exception below when calling ray.data.read_json due to inconsistent number of fields in json files. Is there a way to set a predefined schema for all possible fields in advance when calling read_json?
Thanks,
Exception:
pyarrow.lib.ArrowInvalid: Unable to merge: Field content has incompatible types:
Something like:
import pyarrow as pa
schema=pa.schema([
(‘some_int’, pa.int32()),
(‘some_string’, pa.string())
])
ds = ray.data.read_json(‘some_dir/’, schema=schema)
Hi @James_Liu - thanks for the question. It seems we don’t support defining a schema when reading JSON file.
A little more search I found PyArrow supports to explicitly define a schema when reading json - pyarrow.json.ParseOptions — Apache Arrow v9.0.0
https://arrow.apache.org/docs/python/generated/pyarrow.json.read_json.html
Could you help try to read your JSON file with PyArrow read_json()
and its ParseOptions.explicit_schema
?
If PyArrow works for you, we can add a parse_options
into Ray Datasets read_json()
to pass the parse_options
to the underlying PyArrow reader. Thanks.
1 Like
Hey @James_Liu, it should also be noted that you can pass any pyarrow.json.read_json()
arguments as keyword arguments to ray.data.read_json()
, and Datasets will propagate those arguments to the Arrow read_json()
call:
import pyarrow as pa
import pyarrow.json as json
schema=pa.schema([
(‘some_int’, pa.int32()),
(‘some_string’, pa.string())
])
ds = ray.data.read_json("some_dir/", parse_options=json.ParseOptions(explicit_schema=schema))
Ah but there will probably be an issue here with Pickle serialization: pa.json.ParseOptions()
wasn’t made picklable until Arrow 8.0.0, and we have only added a workaround for pa.json.ReadOptions()
, not pa.json.ParseOptions()
, so I would expect use of pa.json.ParseOptions()
to fail with a Pickle error.
We should definitely expand the workaround to include pa.json.ParseOptions()
.
Hi @James_Liu - just FYI, the issue with Arrow JSON ParseOptions is fixed in Ray master https://github.com/ray-project/ray/pull/27911 .
2 Likes