Ray data write_bigquery and nested types (e.g. array) non expected output

marcodena · March 22, 2024, 5:25pm

Dear community,

I am trying to write a bigquery table from a parquet file. This parquet file contains a string ID and an array. When it is written into bigquery, I see a nested column that is not what I would expect:

[{Ω
  "ID": "655782",
  "embedding": {
    "list": [{
      "element": "0.05823131650686264"
    }, {
      "element": "0.061382777988910675"
    }, {
      "element": "0.15196369588375092"
    }, {
      "element": "0.14905926585197449"
    }, {
      "element": "0.094718299806118011"
    }, {

I would instead expect a list of floats. This behaviour is connected to the nested type in pyarrow (here you can find the param controlling it pyarrow.parquet.write_table — Apache Arrow v19.0.1).

How can I avoid this behaviour? In pure python I would use this flag:

parquet_options = bigquery.format_options.ParquetOptions()
parquet_options.enable_list_inference = True
job_config.parquet_options = parquet_options

and this code:

from google.cloud import bigquery
client = bigquery.Client()

job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.PARQUET,
    schema=[
                bigquery.SchemaField("ID", "STRING"),
                bigquery.SchemaField(
                    "embedding",
                    "FLOAT",
                    mode="REPEATED",
                ),
            ],
    write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
)
parquet_options = bigquery.format_options.ParquetOptions()
parquet_options.enable_list_inference = True
job_config.parquet_options = parquet_options

load_job = client.load_table_from_uri(
    [URI], table_id, job_config=job_config
)

load_job.result()

destination_table = client.get_table([table_id])

thanks

Topic		Replies	Views
Does Ray Data support saving nested dictionaries of tensors to parquet? Ray Data	0	338	November 10, 2023
Write Parquet adds new column value Ray Data	11	1242	April 17, 2023
Map parquet columns causes decoding error with binary data Ray Data	3	99	March 24, 2025
Possible reasons for ray data stucks at write_csv (or write_parquet)?	3	375	July 25, 2023
AWS InvalidRequest Message when writing parquet to private S3 bucket Ray Data	0	535	February 14, 2023

Ray data write_bigquery and nested types (e.g. array) non expected output

Related topics