Dear community,
I am trying to write a bigquery table from a parquet file. This parquet file contains a string ID and an array. When it is written into bigquery, I see a nested column that is not what I would expect:
[{Ω
"ID": "655782",
"embedding": {
"list": [{
"element": "0.05823131650686264"
}, {
"element": "0.061382777988910675"
}, {
"element": "0.15196369588375092"
}, {
"element": "0.14905926585197449"
}, {
"element": "0.094718299806118011"
}, {
I would instead expect a list of floats. This behaviour is connected to the nested type in pyarrow (here you can find the param controlling it pyarrow.parquet.write_table — Apache Arrow v15.0.2).
How can I avoid this behaviour? In pure python I would use this flag:
parquet_options = bigquery.format_options.ParquetOptions()
parquet_options.enable_list_inference = True
job_config.parquet_options = parquet_options
and this code:
from google.cloud import bigquery
client = bigquery.Client()
job_config = bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.PARQUET,
schema=[
bigquery.SchemaField("ID", "STRING"),
bigquery.SchemaField(
"embedding",
"FLOAT",
mode="REPEATED",
),
],
write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
)
parquet_options = bigquery.format_options.ParquetOptions()
parquet_options.enable_list_inference = True
job_config.parquet_options = parquet_options
load_job = client.load_table_from_uri(
[URI], table_id, job_config=job_config
)
load_job.result()
destination_table = client.get_table([table_id])
thanks