arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?
Date Tue, 16 Mar 2021 17:54:13 GMT
Is there a Jira tracking this performance improvement? At minimum
getting to O(log k) indexing time where k is the number of chunks
would be a good goal

On Mon, Mar 15, 2021 at 8:05 PM Micah Kornfield <emkornfield@gmail.com> wrote:
>
> One more micro optimization would be to use interpolation search instead of binary search
(haven't checked if this is what the compute code does)
>
> On Monday, March 15, 2021, Weston Pace <weston.pace@gmail.com> wrote:
>>
>> Since most of the time is probably spent loading each length from RAM
>> (the lengths aren't contiguous and given the size of the chunks they
>> are probably pretty far apart) you can even get significant speedup
>> just by using a contiguous array of lengths in your index.  Note, this
>> can be combined with Antoine's suggestion.  I did a quick
>> micro-benchmark and by the time I got to 32 chunks it was already 5x
>> slower.  With a contiguous array of lengths it was still within 20% of
>> the original time.
>>
>> On Mon, Mar 15, 2021 at 8:14 AM Antoine Pitrou <antoine@python.org> wrote:
>> >
>> > On Mon, 15 Mar 2021 11:02:09 -0700
>> > Micah Kornfield <emkornfield@gmail.com> wrote:
>> > > >
>> > > > Do you know if the iteration is done on the python side, or on the
C++
>> > > > side ?
>> > >
>> > > It appears to be in cython [1] which looking at the definition, I would
>> > > expect to compile down to pretty straightforward C code.
>> > >
>> > > May I create a post on JIRA about adding an indexing structure for
>> > > > ChunkedArray ?
>> > >
>> > > Yes, please do, I'm sure others will have thoughts on the best way to
>> > > incorporate this (it might also be a good first contribution to the project
>> > > if you are interested).  I think also providing some context from this
>> > > thread on the relative slowness would be good (the bottleneck still might
>> > > be something else, that others more familiar with the code could point
to).
>> >
>> > Just for the record, there is already something like this in the
>> > compute layer.  The general idea can probably be reused.
>> >
>> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >

Mime
View raw message