arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wes McKinney (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (ARROW-432) [Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs
Date Sat, 24 Dec 2016 15:04:58 GMT

     [ https://issues.apache.org/jira/browse/ARROW-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wes McKinney resolved ARROW-432.
--------------------------------
    Resolution: Fixed

Issue resolved by pull request 251
[https://github.com/apache/arrow/pull/251]

> [Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas
internals APIs
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-432
>                 URL: https://issues.apache.org/jira/browse/ARROW-432
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> I'll take this one on. 
> While we're efficiently constructing individual NumPy arrays for pandas, even in the
zero-copy case pandas.DataFrame will perform an extra memory copy and consolidation step internally
at the end. 
> This is particular to the pandas 0.x/1.x memory layout, and will change in the future
with pandas 2.0, but that's quite a ways off from wide use. 
> We can avoid this overhead for now by
> * computing the exact internal "block" structure of the DataFrame. Since we know the
null counts of the Arrow data, we can determine if type casts to accommodate nulls are necessary
up front
> * pre-allocating empty column-major blocks
> * writing out into the block slices
> * construct DataFrame from blocks with zero copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message