arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <>
Subject [jira] [Commented] (ARROW-39) C++: Logical chunked arrays / columns: conforming to fixed chunk sizes
Date Fri, 08 Apr 2016 05:47:25 GMT


Jacques Nadeau commented on ARROW-39:

Can you expound here? I'm not sure what you mean by "chunk". If you're speaking about batches
of records, I don't think fixed record batch sizes should be a requirement.

> C++: Logical chunked arrays / columns: conforming to fixed chunk sizes
> ----------------------------------------------------------------------
>                 Key: ARROW-39
>                 URL:
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
> Implementing algorithms on large arrays assembled in physical chunks is problematic if:
> - The chunks are not all the same size (except possibly the last chunk, which can be
less). Otherwise, retrieving a particular element is in general a O(log num_chunks) operation
> - The chunk size is not a power of 2. Computing integer modulus with a non-multiple of
2 requires more clock cycles (in other words, {{i % p}} is much more expensive to compute
than {{i & (p - 1)}}, but the latter only works if p is a power of 2)
> Most of the Arrow data adapters will either feature contiguous data (1 chunk, so chunking
is not an issue) or a regular chunk size, so this isn't as much of an immediate concern, but
we should consider making it a contract of any data structures dealing in multiple arrays.

> In general, it would be preferable to reorganize memory into either a regular chunksize
(like 64K values per chunk) or a contiguous memory region. I would prefer for the moment to
not to invest significant energy in writing algorithms for data with irregular chunk sizes.

This message was sent by Atlassian JIRA

View raw message