arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?
Date Tue, 16 Mar 2021 02:05:10 GMT
One more micro optimization would be to use interpolation search instead of
binary search (haven't checked if this is what the compute code does)

On Monday, March 15, 2021, Weston Pace <weston.pace@gmail.com> wrote:

> Since most of the time is probably spent loading each length from RAM
> (the lengths aren't contiguous and given the size of the chunks they
> are probably pretty far apart) you can even get significant speedup
> just by using a contiguous array of lengths in your index.  Note, this
> can be combined with Antoine's suggestion.  I did a quick
> micro-benchmark and by the time I got to 32 chunks it was already 5x
> slower.  With a contiguous array of lengths it was still within 20% of
> the original time.
>
> On Mon, Mar 15, 2021 at 8:14 AM Antoine Pitrou <antoine@python.org> wrote:
> >
> > On Mon, 15 Mar 2021 11:02:09 -0700
> > Micah Kornfield <emkornfield@gmail.com> wrote:
> > > >
> > > > Do you know if the iteration is done on the python side, or on the
> C++
> > > > side ?
> > >
> > > It appears to be in cython [1] which looking at the definition, I would
> > > expect to compile down to pretty straightforward C code.
> > >
> > > May I create a post on JIRA about adding an indexing structure for
> > > > ChunkedArray ?
> > >
> > > Yes, please do, I'm sure others will have thoughts on the best way to
> > > incorporate this (it might also be a good first contribution to the
> project
> > > if you are interested).  I think also providing some context from this
> > > thread on the relative slowness would be good (the bottleneck still
> might
> > > be something else, that others more familiar with the code could point
> to).
> >
> > Just for the record, there is already something like this in the
> > compute layer.  The general idea can probably be reused.
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/
> arrow/compute/kernels/vector_sort.cc#L94
> >
> > Regards
> >
> > Antoine.
> >
> >
>

Mime
View raw message