arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Gooch <tgo...@netflix.com>
Subject Re: [C++][python] Arrow Parquet metadata issues with round trip read/write table
Date Thu, 29 Apr 2021 16:33:34 GMT
Thanks Weston, unfortunately this would be consumed downstream by Spark and
Trino.  I do actually have the iceberg schema saved into the table metadata
but im pretty sure(haven't browsed the code of either yet, but still...)
that neither engine will leverage that info.



On Thu, Apr 29, 2021, 8:50 AM Weston Pace <weston@ursacomputing.com> wrote:

> You could copy the parquet field ids when you originally read in the data
> and write them out to a custom metadata field.  This will get saved
> (unmodified) into the parquet file.  Then, after reading the parquet file,
> you could copy your custom metadata back into the field_id field (replacing
> the made up field IDs).
>
> This won't help if your workflow is (external tool -> arrow -> parquet
> file -> external tool) but it may help if your workflow is (external tool
> -> arrow -> parquet file -> arrow -> external tool)
>
> On Thu, Apr 29, 2021 at 1:56 AM Ted Gooch <tgooch@netflix.com> wrote:
>
>> Hi,
>>
>> I'm having an issue where I'm reading in some parquet data, and writing
>> it back, and when I write the field_id's don't match the schema that I
>> provided to pyarrow.parquet.write_table. I browsed through the PR that
>> added support for field_id metadata, and it looks like this is a known
>> behavior and has this currently open issue:
>> https://issues.apache.org/jira/browse/PARQUET-1798
>>
>> Is there any way in the current API to get the write_table to use the
>> metadata from the provided schema? Or is the DFS assignment of field_id's
>> the only behavior pending the issue referenced above?
>>
>> *Basic Example here:*
>>
>> import pyarrow.parquet as pq
>> print("------------ORIGINAL------------")
>> print(arrow_tbl.schema)
>> pq.write_table(arrow_tbl, 'example.parquet')
>> read_back = pq.ParquetFile('example.parquet')
>> print("------------READ BACK------------")
>> print(read_back.schema_arrow)
>>
>> *Output*
>> ------------ORIGINAL------------
>> tester_flags: list<element: string>
>>   child 0, element: string
>>     -- field metadata --
>>     PARQUET:field_id: '36'
>>   -- field metadata --
>>   PARQUET:field_id: '16'
>> signup_country_iso_code: string
>>   -- field metadata --
>>   PARQUET:field_id: '17'
>> -- schema metadata --
>> iceberg.schema:
>> '{"type":"struct","fields":[{"id":1,"name":"account_id","' + 5286
>> ------------READ BACK------------
>> tester_flags: list<element: string>
>>   child 0, element: string
>>     -- field metadata --
>>     PARQUET:field_id: '3'
>>   -- field metadata --
>>   PARQUET:field_id: '1'
>> signup_country_iso_code: string
>>   -- field metadata --
>>   PARQUET:field_id: '4'
>> -- schema metadata --
>> iceberg.schema:
>> '{"type":"struct","fields":[{"id":1,"name":"account_id","' + 5286
>>
>>

Mime
View raw message