arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Gooch <tgo...@netflix.com>
Subject [C++][python] Arrow Parquet metadata issues with round trip read/write table
Date Thu, 29 Apr 2021 11:56:33 GMT
Hi,

I'm having an issue where I'm reading in some parquet data, and writing it
back, and when I write the field_id's don't match the schema that I
provided to pyarrow.parquet.write_table. I browsed through the PR that
added support for field_id metadata, and it looks like this is a known
behavior and has this currently open issue:
https://issues.apache.org/jira/browse/PARQUET-1798

Is there any way in the current API to get the write_table to use the
metadata from the provided schema? Or is the DFS assignment of field_id's
the only behavior pending the issue referenced above?

*Basic Example here:*

import pyarrow.parquet as pq
print("------------ORIGINAL------------")
print(arrow_tbl.schema)
pq.write_table(arrow_tbl, 'example.parquet')
read_back = pq.ParquetFile('example.parquet')
print("------------READ BACK------------")
print(read_back.schema_arrow)

*Output*
------------ORIGINAL------------
tester_flags: list<element: string>
  child 0, element: string
    -- field metadata --
    PARQUET:field_id: '36'
  -- field metadata --
  PARQUET:field_id: '16'
signup_country_iso_code: string
  -- field metadata --
  PARQUET:field_id: '17'
-- schema metadata --
iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
+ 5286
------------READ BACK------------
tester_flags: list<element: string>
  child 0, element: string
    -- field metadata --
    PARQUET:field_id: '3'
  -- field metadata --
  PARQUET:field_id: '1'
signup_country_iso_code: string
  -- field metadata --
  PARQUET:field_id: '4'
-- schema metadata --
iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
+ 5286

Mime
View raw message