Here is the use case: quantiles for search query times. 2% gives the
fastest possible queries, and 98% gives the slowest normal queries and
ignores a garbage collection colliding with a cosmic ray.
The next step is to maintain a reservoir sample of search queries, and
calculate the quantiles on demand. This allows drilling down into
"what queries are consistently slow?". This definitely wants the
online algorithm, because it allows twopass classification: the first
pass through the queries establishes the quantiles, and the second
pass classifies the query time among the quantiles. This would be very
useful.
Lance
On Sat, May 21, 2011 at 10:14 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> This is not the same algorithm at all. In the first place, there is a
> requirement that you know the size of your dataset in advance so that you
> can set \tau. It might be possible to adaptively set \tau as you go, but
> that isn't obvious to me in the few minutes I have looked at this. My worry
> is that if you do this, you may have already thrown away important data by
> the time that you realize that \tau needs to be larger than you expected.
>
> Secondly, this appears to be most useful for computing moderately extreme
> quantiles. That may be what you intend to do. Or not.
>
> I think that the other algorithm is better in that it requires much less
> memory and is completely online.
>
> The basic idea in the online algorithm is that you maintain an estimate of
> the quantile that you want and an estimate of the PDF at that point. You
> estimate the PDF by keeping a count of the points that you have seen within
> bounds that you tighten over time. The current estimate is adjust up or
> down by an amount based on the quantile that you are estimating times the
> current PDF estimate. For each quantile that you are estimating, you only
> need to keep a few values. In the case of the current Mahout
> implementation, I reuse data stored for one quantile for a different
> purpose in the estimation of other quantiles so that we pretty much just
> need to keep the 5 current quartile estimates and possibly a total count.
> This compares to keeping a thousand values plus counts per quantile with
> the IBM paper you reference.
>
>
> On Sat, May 21, 2011 at 4:33 PM, Lance Norskog <goksron@gmail.com> wrote:
>
>> This seems to be the same thing, from '95. And simple enough that even
>> I can follow it.
>>
>>
>> http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/comad95.pdf
>>
>> On Sat, May 21, 2011 at 3:21 PM, Ted Dunning <ted.dunning@gmail.com>
>> wrote:
>> > By definition, quaRtiles are quartiles.
>> >
>> > You can adapt the code to generate other quaNtiles by adjusting the
>> > parameters that adjust the estimates up and down. Right now, I use the
>> > interquartile range to estimate the probability density. You would need
>> to
>> > do something else.
>> >
>> > The original paper has some help in this regard. See the javadoc for the
>> > link.
>> >
>> > On Sat, May 21, 2011 at 3:03 PM, Lance Norskog <goksron@gmail.com>
>> wrote:
>> >
>> >> How would one change OnlineSummarizer to allow setting the 1 and 3
>> >> quartiles? They are hardcoded at 25% and 75%.
>> >>
>> >> Q1 and Q3 at 02%/98% gives a way to ignore outliers.
>> >>
>> >> 
>> >> Lance Norskog
>> >> goksron@gmail.com
>> >>
>> >
>>
>>
>>
>> 
>> Lance Norskog
>> goksron@gmail.com
>>
>

Lance Norskog
goksron@gmail.com
