mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: OnlineSummarizer mods
Date Mon, 23 May 2011 01:13:02 GMT
Thanks!

On Sun, May 22, 2011 at 12:45 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> The simplest implementation for this would be to keep a tree of values.
>  While you have less than N (about 1000) values in the tree, you add each
> incoming value to the tree.  When you have more than N values after adding a
> new sample, pick values at random to delete until you are down to N values
> in the tree.  Any time you have a value that is in the top or bottom 2% you
> can log it.  This will get very nearly the result you want at the cost of a
> bit of work for the insertion and deletion.
>
> The data structure to use is a little tricky.  You want to be able to
> efficiently insert an element, of course, but you also want to find the n-th
> largest element quickly and to determine what the rank of a newly inserted
> element is reasonably quickly.  For most applications with time scales on
> the order of milliseconds, simply keeping a sorted array will be a fine
> solution.  This obviously costs a fair bit to shift things when you do an
> insertion.  One fun trick to improve this by a substantial factor is to size
> the array larger than necessary and keep "tombstone" markers in the array
> where ever you have deleted an element.  Then when shifting to make room for
> a new element, you will have to shift data only until you find a tombstone
> or the end of the array and you won't have to shift at all on deletion.
>  Finding the current rank or finding the n-th largest element will require a
> scan, but because of caching and fast processors, this will probably be
> nearly as fast as any fancy data structure you can build especially since
> you will only be scanning from one end or the other.
>
> This won't be terribly accurate, not will it be as tight on memory, but it
> will be very simple to get right and for the slow query logging problem, it
> should be plenty good enough.
>
> On Sun, May 22, 2011 at 12:08 AM, Lance Norskog <goksron@gmail.com> wrote:
>
>> Here is the use case: quantiles for search query times. 2% gives the
>> fastest possible queries, and 98% gives the slowest normal queries and
>> ignores a garbage collection colliding with a cosmic ray.
>>
>> The next step is to maintain a reservoir sample of search queries, and
>> calculate the quantiles on demand. This allows drilling down into
>> "what queries are consistently slow?". This definitely wants the
>> online algorithm, because it allows two-pass classification: the first
>> pass through the queries establishes the quantiles, and the second
>> pass classifies the query time among the quantiles. This would be very
>> useful.
>>
>> Lance
>>
>> On Sat, May 21, 2011 at 10:14 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> > This is not the same algorithm at all.  In the first place, there is a
>> > requirement that you know the size of your data-set in advance so that
>> you
>> > can set \tau.  It might be possible to adaptively set \tau as you go, but
>> > that isn't obvious to me in the few minutes I have looked at this.  My
>> worry
>> > is that if you do this, you may have already thrown away important data
>> by
>> > the time that you realize that \tau needs to be larger than you expected.
>> >
>> > Secondly, this appears to be most useful for computing moderately extreme
>> > quantiles.  That may be what you intend to do.  Or not.
>> >
>> > I think that the other algorithm is better in that it requires much less
>> > memory and is completely on-line.
>> >
>> > The basic idea in the on-line algorithm is that you maintain an estimate
>> of
>> > the quantile that you want and an estimate of the PDF at that point.  You
>> > estimate the PDF by keeping a count of the points that you have seen
>> within
>> > bounds that you tighten over time.  The current estimate is adjust up or
>> > down by an amount based on the quantile that you are estimating times the
>> > current PDF estimate.  For each quantile that you are estimating, you
>> only
>> > need to keep a few values.  In the case of the current Mahout
>> > implementation, I re-use data stored for one quantile for a different
>> > purpose in the estimation of other quantiles so that we pretty much just
>> > need to keep the 5 current quartile estimates and possibly a total count.
>> >  This compares to keeping a thousand values plus counts per quantile with
>> > the IBM paper you reference.
>> >
>> >
>> > On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <goksron@gmail.com>
>> wrote:
>> >
>> >> This seems to be the same thing, from '95. And simple enough that even
>> >> I can follow it.
>> >>
>> >>
>> >>
>> http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf
>> >>
>> >> On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <ted.dunning@gmail.com>
>> >> wrote:
>> >> > By definition, quaRtiles are quartiles.
>> >> >
>> >> > You can adapt the code to generate other quaNtiles by adjusting the
>> >> > parameters that adjust the estimates up and down.  Right now, I use
>> the
>> >> > inter-quartile range to estimate the probability density.  You would
>> need
>> >> to
>> >> > do something else.
>> >> >
>> >> > The original paper has some help in this regard.  See the javadoc
for
>> the
>> >> > link.
>> >> >
>> >> > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <goksron@gmail.com>
>> >> wrote:
>> >> >
>> >> >> How would one change OnlineSummarizer to allow setting the 1 and
3
>> >> >> quartiles? They are hard-coded at 25% and 75%.
>> >> >>
>> >> >> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
>> >> >>
>> >> >> --
>> >> >> Lance Norskog
>> >> >> goksron@gmail.com
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Lance Norskog
>> >> goksron@gmail.com
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message