Thanks!
On Sun, May 22, 2011 at 12:45 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> The simplest implementation for this would be to keep a tree of values.
> While you have less than N (about 1000) values in the tree, you add each
> incoming value to the tree. When you have more than N values after adding a
> new sample, pick values at random to delete until you are down to N values
> in the tree. Any time you have a value that is in the top or bottom 2% you
> can log it. This will get very nearly the result you want at the cost of a
> bit of work for the insertion and deletion.
>
> The data structure to use is a little tricky. You want to be able to
> efficiently insert an element, of course, but you also want to find the nth
> largest element quickly and to determine what the rank of a newly inserted
> element is reasonably quickly. For most applications with time scales on
> the order of milliseconds, simply keeping a sorted array will be a fine
> solution. This obviously costs a fair bit to shift things when you do an
> insertion. One fun trick to improve this by a substantial factor is to size
> the array larger than necessary and keep "tombstone" markers in the array
> where ever you have deleted an element. Then when shifting to make room for
> a new element, you will have to shift data only until you find a tombstone
> or the end of the array and you won't have to shift at all on deletion.
> Finding the current rank or finding the nth largest element will require a
> scan, but because of caching and fast processors, this will probably be
> nearly as fast as any fancy data structure you can build especially since
> you will only be scanning from one end or the other.
>
> This won't be terribly accurate, not will it be as tight on memory, but it
> will be very simple to get right and for the slow query logging problem, it
> should be plenty good enough.
>
> On Sun, May 22, 2011 at 12:08 AM, Lance Norskog <goksron@gmail.com> wrote:
>
>> Here is the use case: quantiles for search query times. 2% gives the
>> fastest possible queries, and 98% gives the slowest normal queries and
>> ignores a garbage collection colliding with a cosmic ray.
>>
>> The next step is to maintain a reservoir sample of search queries, and
>> calculate the quantiles on demand. This allows drilling down into
>> "what queries are consistently slow?". This definitely wants the
>> online algorithm, because it allows twopass classification: the first
>> pass through the queries establishes the quantiles, and the second
>> pass classifies the query time among the quantiles. This would be very
>> useful.
>>
>> Lance
>>
>> On Sat, May 21, 2011 at 10:14 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> > This is not the same algorithm at all. In the first place, there is a
>> > requirement that you know the size of your dataset in advance so that
>> you
>> > can set \tau. It might be possible to adaptively set \tau as you go, but
>> > that isn't obvious to me in the few minutes I have looked at this. My
>> worry
>> > is that if you do this, you may have already thrown away important data
>> by
>> > the time that you realize that \tau needs to be larger than you expected.
>> >
>> > Secondly, this appears to be most useful for computing moderately extreme
>> > quantiles. That may be what you intend to do. Or not.
>> >
>> > I think that the other algorithm is better in that it requires much less
>> > memory and is completely online.
>> >
>> > The basic idea in the online algorithm is that you maintain an estimate
>> of
>> > the quantile that you want and an estimate of the PDF at that point. You
>> > estimate the PDF by keeping a count of the points that you have seen
>> within
>> > bounds that you tighten over time. The current estimate is adjust up or
>> > down by an amount based on the quantile that you are estimating times the
>> > current PDF estimate. For each quantile that you are estimating, you
>> only
>> > need to keep a few values. In the case of the current Mahout
>> > implementation, I reuse data stored for one quantile for a different
>> > purpose in the estimation of other quantiles so that we pretty much just
>> > need to keep the 5 current quartile estimates and possibly a total count.
>> > This compares to keeping a thousand values plus counts per quantile with
>> > the IBM paper you reference.
>> >
>> >
>> > On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <goksron@gmail.com>
>> wrote:
>> >
>> >> This seems to be the same thing, from '95. And simple enough that even
>> >> I can follow it.
>> >>
>> >>
>> >>
>> http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf
>> >>
>> >> On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <ted.dunning@gmail.com>
>> >> wrote:
>> >> > By definition, quaRtiles are quartiles.
>> >> >
>> >> > You can adapt the code to generate other quaNtiles by adjusting the
>> >> > parameters that adjust the estimates up and down. Right now, I use
>> the
>> >> > interquartile range to estimate the probability density. You would
>> need
>> >> to
>> >> > do something else.
>> >> >
>> >> > The original paper has some help in this regard. See the javadoc
for
>> the
>> >> > link.
>> >> >
>> >> > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <goksron@gmail.com>
>> >> wrote:
>> >> >
>> >> >> How would one change OnlineSummarizer to allow setting the 1 and
3
>> >> >> quartiles? They are hardcoded at 25% and 75%.
>> >> >>
>> >> >> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
>> >> >>
>> >> >> 
>> >> >> Lance Norskog
>> >> >> goksron@gmail.com
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> 
>> >> Lance Norskog
>> >> goksron@gmail.com
>> >>
>> >
>>
>>
>>
>> 
>> Lance Norskog
>> goksron@gmail.com
>>
>

Lance Norskog
goksron@gmail.com
