arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niranda Perera <niranda.per...@gmail.com>
Subject Re: compute::Take & ChunkedArrays
Date Sun, 17 Jan 2021 14:59:32 GMT
Hi Wes,

Thanks. On the top of my head, that was a similar algorithm I had in mind
as well.
Is this the JIRA you were referring to? [1]
I see that there are some improvements that have been done here [2].

I guess bug reports like this [3] are also related to the same scenario.

Is there anyone working on this?

Best

[1] https://issues.apache.org/jira/browse/ARROW-5454
[2] https://github.com/apache/arrow/pull/8823
[3] https://issues.apache.org/jira/browse/ARROW-10799

On Fri, Jan 15, 2021 at 10:38 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> You can do that, but note that the implementation is currently not
> efficient, see
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_selection.cc#L1909
>
> Rather than pre-concatenating the chunks (which can easily fail) and
> then invoking Take on the resulting concatenated Array, it would be
> better to do a O(N log K) take on the chunks directly, where N is the
> number of take indices and K is the number of chunks.
>
> For example, if you have chunks of size
>
> 10
> 50
> 100
> 20
>
> then the algorithm computes the following offset table:
>
> 0
> 10
> 60
> 160
> 180
>
> Indices relative to the whole ChunkedArray are translated to (chunk
> number, intrachunk index), for example:
>
> take with [5, 40, 100, 170] is translated by doing binary searches in
> the offset table to:
>
> (chunk=0, relative_index=5)
> (1, 30)
> (2, 40)
> (3, 10)
>
> Consecutive indices from the same chunk are batched together and then
> Take is invoked on the respective chunk (with boundschecking disabled)
> to select a chunk for the resulting output ChunkedArray.
>
> Might be helpful to copy this to the appropriate Jira (I'm sure there
> is one already) to assist the person who implements this.
>
> Thanks,
> Wes
>
> On Mon, Jan 11, 2021 at 10:01 AM Niranda Perera
> <niranda.perera@gmail.com> wrote:
> >
> > Hi all,
> >
> > I was wondering how the Take API works with ChunkedArrays?
> > ex: If we have a ChunkedArray[100] with Array1[50] and Array2[50]
> > so, if I want an element from each array, can I pass something like [10,
> 60] as the indices?
> >
> > --
> > Niranda Perera
> > @n1r44
> > +1 812 558 8884 / +94 71 554 8430
> > https://www.linkedin.com/in/niranda
>


-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+1 812 558 8884 / +94 71 554 8430
https://www.linkedin.com/in/niranda

Mime
View raw message