arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ARROW-571) [Python] Add APIs to build Parquet files incrementally from Arrow tables
Date Mon, 20 Feb 2017 17:59:44 GMT

    [ https://issues.apache.org/jira/browse/ARROW-571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874884#comment-15874884
] 

Uwe L. Korn commented on ARROW-571:
-----------------------------------

Writing RowGroup-wise is already supported by {{parquet_arrow}}, just not exposed in Python.
One thing we're missing in C++ is incrementally building up the schema.

For example we have the use case that we already know 20 columns that shall be written into
the Parquet file. We can serialise these columns and free the memory associated with them.
But several other columns (of same length of course) will be generated later in a pipeline
but that first part of the pipeline is unaware how many and of which type. Currently we build
a Pandas DataFrame until we have reached the end of the pipeline. Subsequent jobs also only
read a subset of columns (but different combinations thereof). Directly writing out these
columns are they are computed would help us save a lot of RAM. Related issue for that: https://issues.apache.org/jira/browse/PARQUET-749

> [Python] Add APIs to build Parquet files incrementally from Arrow tables
> ------------------------------------------------------------------------
>
>                 Key: ARROW-571
>                 URL: https://issues.apache.org/jira/browse/ARROW-571
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Wes McKinney
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message