Can you give us some data on what the current performance looks like, vs
what you would expect? Are we spend most of the time in the sort, or the
Window function operator?
On Thu, Jun 11, 2015 at 10:55 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Speed in many such loops depends a lot on how the loops are ordered so that
> cache and registers can be reused. I have no idea what will make your
> windowing functions fast, but I can say some things about what makes matrix
> math fast.
>
> The key with matrix multiplication is that there are n^3/2 operations to do
> on n^2 elements. The minimum number of memory operations is n^2 which
> sounds good because modern CPU's can often do hundreds of operations per
> main memory access. This also means that if we code a naive
> implementation, we will generally be memory bound because that will
> increase the number of memory operations to k n^3.
>
> To avoid that, the loops involved can be restructured so that larger and
> larger blocks of data are used. At the lowest levels, small blocks of 2 x
> 4 values or so are used to code the multiplication since all of these
> values can be kept in registers. At one step up, the computation is
> structured to only operate on elements that fit in the fastest level of
> cache which is typically 10's of kB in size.
>
> Your loop looks like this:
>
> for (start = 0 ... endn) {
> initialize()
> for (offset = 0 ... n1) {
> aggregate(start + offset)
> }
> finalize()
> }
>
> This arrangement is pretty cache friendly if n is small enough, but it
> seems that it could be even more friendly if you kept all of the
> aggregators at the read and handed each sample to all of the aggregators
> before moving to the next position.
>
> On Thu, Jun 11, 2015 at 3:55 PM, Abdel Hakim Deneche <
> adeneche@maprtech.com>
> wrote:
>
> > Hi all,
> >
> > The purpose of this email is to describe how window functions are
> computed
> > and to try to come up with "better" ways to do it.
> >
> > DRILL3200 <https://issues.apache.org/jira/browse/DRILL3200> added
> > support
> > for RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK and CUME_DIST but also
> made
> > some significant improvements to the way Drill computes window
> functions.
> >
> > The general idea was to update the code to only support the default frame
> > which makes it run faster and use less memory.
> >
> > WindowFrameRecordBatch works similarly to StreamingAggregate: it requires
> > the data to be sorted on the partition and order by columns and only
> > computes one frame at a time. With the default frame we only need to
> > aggregate every row only once.
> > Memory consumption depend on the data, but in general each record batch
> is
> > kept in memory until we are ready to process all it's rows (which is
> > possible when we find the last peer row of the batch's last row). Drill's
> > external sort can spill to disk if data is too big, and we only need to
> > keep at most one partition's worth of data in memory for the window
> > functions to be computed (when over clause doesn't contain an order by)
> >
> > Each time a batch is ready to be processed we do the following:
> >
> > 1 we start with it's first row (current row)
> > 2 we compute the length of the current row's frame (in this case we find
> > the number of peer rows for the current row),
> > 3 we aggregate (this includes computing the window function values) all
> > rows of the current frame
> > 4 we write the aggregated value in each row of the current frame.
> > 5 We then move to the 1st non peer row which becomes the current row
> > 6 if we didn't reach the end of the current batch go back to 2
> >
> > With all this in mind, how can we improve the performance of window
> > functions ?
> >
> > Thanks!
> > 
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> > <http://www.mapr.com/>
> >
> >
> > Now Available  Free Hadoop OnDemand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>

Steven Phillips
Software Engineer
mapr.com
