arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wes McKinney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ARROW-39) C++: Logical chunked arrays / columns: conforming to fixed chunk sizes
Date Fri, 08 Apr 2016 14:57:25 GMT

    [ https://issues.apache.org/jira/browse/ARROW-39?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232285#comment-15232285
] 

Wes McKinney commented on ARROW-39:
-----------------------------------

This JIRA isn't about IPC; this is about writing in-memory algorithms for manipulating tables
consisting of multiple chunks (multiple record batches). For example: performing a merge-sort
in-memory on a table consisting of many row batches. The only point here was that, computationally,
there's probably not much benefit in implementing algorithms that deal with tables with irregular
chunk sizes (where random access in general requires a binary search) or chunk sizes that
are not a power of two (since arithmetic with non-powers-of-2 requires a lot more CPU cycles).

The context for this JIRA originally was this data structure: https://github.com/apache/arrow/blob/master/cpp/src/arrow/column.h#L37

It's fine as a data container, but as soon as you start writing analytics (think: pandas,
dplyr, and things like that) these things become a problem. 

> C++: Logical chunked arrays / columns: conforming to fixed chunk sizes
> ----------------------------------------------------------------------
>
>                 Key: ARROW-39
>                 URL: https://issues.apache.org/jira/browse/ARROW-39
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>
> Implementing algorithms on large arrays assembled in physical chunks is problematic if:
> - The chunks are not all the same size (except possibly the last chunk, which can be
less). Otherwise, retrieving a particular element is in general a O(log num_chunks) operation
> - The chunk size is not a power of 2. Computing integer modulus with a non-multiple of
2 requires more clock cycles (in other words, {{i % p}} is much more expensive to compute
than {{i & (p - 1)}}, but the latter only works if p is a power of 2)
> Most of the Arrow data adapters will either feature contiguous data (1 chunk, so chunking
is not an issue) or a regular chunk size, so this isn't as much of an immediate concern, but
we should consider making it a contract of any data structures dealing in multiple arrays.

> In general, it would be preferable to reorganize memory into either a regular chunksize
(like 64K values per chunk) or a contiguous memory region. I would prefer for the moment to
not to invest significant energy in writing algorithms for data with irregular chunk sizes.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message