arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: compute::Take & ChunkedArrays
Date Fri, 15 Jan 2021 15:37:53 GMT
You can do that, but note that the implementation is currently not
efficient, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_selection.cc#L1909

Rather than pre-concatenating the chunks (which can easily fail) and
then invoking Take on the resulting concatenated Array, it would be
better to do a O(N log K) take on the chunks directly, where N is the
number of take indices and K is the number of chunks.

For example, if you have chunks of size

10
50
100
20

then the algorithm computes the following offset table:

0
10
60
160
180

Indices relative to the whole ChunkedArray are translated to (chunk
number, intrachunk index), for example:

take with [5, 40, 100, 170] is translated by doing binary searches in
the offset table to:

(chunk=0, relative_index=5)
(1, 30)
(2, 40)
(3, 10)

Consecutive indices from the same chunk are batched together and then
Take is invoked on the respective chunk (with boundschecking disabled)
to select a chunk for the resulting output ChunkedArray.

Might be helpful to copy this to the appropriate Jira (I'm sure there
is one already) to assist the person who implements this.

Thanks,
Wes

On Mon, Jan 11, 2021 at 10:01 AM Niranda Perera
<niranda.perera@gmail.com> wrote:
>
> Hi all,
>
> I was wondering how the Take API works with ChunkedArrays?
> ex: If we have a ChunkedArray[100] with Array1[50] and Array2[50]
> so, if I want an element from each array, can I pass something like [10, 60] as the indices?
>
> --
> Niranda Perera
> @n1r44
> +1 812 558 8884 / +94 71 554 8430
> https://www.linkedin.com/in/niranda

Mime
View raw message