arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: compute::Take & ChunkedArrays
Date Sun, 17 Jan 2021 16:29:02 GMT
On Sun, Jan 17, 2021 at 8:59 AM Niranda Perera <niranda.perera@gmail.com> wrote:
>
> Hi Wes,
>
> Thanks. On the top of my head, that was a similar algorithm I had in mind as well.
> Is this the JIRA you were referring to? [1]
> I see that there are some improvements that have been done here [2].
>
> I guess bug reports like this [3] are also related to the same scenario.
>
> Is there anyone working on this?

If open Jira issues are not assigned to anyone you can assume that no
one is working on them.

>
> Best
>
> [1] https://issues.apache.org/jira/browse/ARROW-5454
> [2] https://github.com/apache/arrow/pull/8823
> [3] https://issues.apache.org/jira/browse/ARROW-10799
>
> On Fri, Jan 15, 2021 at 10:38 AM Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>> You can do that, but note that the implementation is currently not
>> efficient, see
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_selection.cc#L1909
>>
>> Rather than pre-concatenating the chunks (which can easily fail) and
>> then invoking Take on the resulting concatenated Array, it would be
>> better to do a O(N log K) take on the chunks directly, where N is the
>> number of take indices and K is the number of chunks.
>>
>> For example, if you have chunks of size
>>
>> 10
>> 50
>> 100
>> 20
>>
>> then the algorithm computes the following offset table:
>>
>> 0
>> 10
>> 60
>> 160
>> 180
>>
>> Indices relative to the whole ChunkedArray are translated to (chunk
>> number, intrachunk index), for example:
>>
>> take with [5, 40, 100, 170] is translated by doing binary searches in
>> the offset table to:
>>
>> (chunk=0, relative_index=5)
>> (1, 30)
>> (2, 40)
>> (3, 10)
>>
>> Consecutive indices from the same chunk are batched together and then
>> Take is invoked on the respective chunk (with boundschecking disabled)
>> to select a chunk for the resulting output ChunkedArray.
>>
>> Might be helpful to copy this to the appropriate Jira (I'm sure there
>> is one already) to assist the person who implements this.
>>
>> Thanks,
>> Wes
>>
>> On Mon, Jan 11, 2021 at 10:01 AM Niranda Perera
>> <niranda.perera@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I was wondering how the Take API works with ChunkedArrays?
>> > ex: If we have a ChunkedArray[100] with Array1[50] and Array2[50]
>> > so, if I want an element from each array, can I pass something like [10, 60]
as the indices?
>> >
>> > --
>> > Niranda Perera
>> > @n1r44
>> > +1 812 558 8884 / +94 71 554 8430
>> > https://www.linkedin.com/in/niranda
>
>
>
> --
> Niranda Perera
> @n1r44
> +1 812 558 8884 / +94 71 554 8430
> https://www.linkedin.com/in/niranda

Mime
View raw message