Le 18/09/2010 17:39, Luc Maisonobe a écrit :
> Le 18/09/2010 17:36, Phil Steitz a écrit :
>> On 9/18/10 10:24 AM, Luc Maisonobe wrote:
>>> Le 18/09/2010 16:16, Phil Steitz a écrit :
>>>> On 9/17/10 3:19 PM, Luc Maisonobe wrote:
>>>>> Le 17/09/2010 19:55, Ted Dunning a écrit :
>>>>>> There are also online percentile estimation methods that require
>>>>>> only a
>>>>>> single pass over the data in order to get good estimates of several
>>>>>> quantile
>>>>>> at the same time.
>>>>>>
>>>>>> If you are interested in getting an estimate of some quantile of
the
>>>>>> underlying distribution that generated your data then these online
>>>>>> methods
>>>>>> will give you an estimate that is very nearly as accurate as sorting
>>>>>> your
>>>>>> sample. Sorting gives you the exact quantiles of your sample, but
>>>>>> only an
>>>>>> estimate of the quantiles of your underlying distribution.
>>>>>>
>>>>>> There is a simplified implementation of this in Mahout along with
>>>>>> test cases
>>>>>> that demonstrate reasonable accuracy.
>>>>>>
>>>>>> These techniques are well described in the article: *Incremental
>>>>>> Quantile
>>>>>> Estimation for Massive
>>>>>> Tracking<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580>
>>>>>>
>>>>>>
>>>>>> *
>>>>>> *
>>>>>> *
>>>>>> *(available at
>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580
)*
>>>>>
>>>>> Thanks for the pointer, it looks interesting.
>>>>>
>>>>>> *
>>>>>> *
>>>>>> *Regarding the incremental selection method to find multiple
>>>>>> quantiles, I
>>>>>> think that you save a little bit if you are looking for a few
>>>>>> quantiles, but
>>>>>> the added complexity and special cases are going to make testing
>>>>>> difficult.
>>>>>> Wouldn't it be better to use one of the following strategies:*
>>>>>> *
>>>>>> *
>>>>>> * keep a copy of the sorted data and if that copy is available,
just
>>>>>> use
>>>>>> it. This cuts the cost of 100 quantiles probed incrementally to
a
>>>>>> single
>>>>>> sort.
>>>>>
>>>>> This is exactly what this user did: put the sort out of the loop.
>>>>>
>>>>>> Moreover, since sort is n log n without a radix sort, then the
>>>>>> increment selection algorithm can only win if 100< log n which
is
>>>>>> pretty
>>>>>> unlikely.*
>>>>>
>>>>> I don't understand. Partitionbased selection algorithms are basically
>>>>> partitionbased sort algorithm where we recurse in only one of the
>>>>> partition once the pivot has been chosen. Subsequent calls therefore
>>>>> don't restart from an array of size n but from smaller subarrays, has
>>>>> the pivot can be saved (at least the top level ones). If at the end the
>>>>> number of selections is so high the array ends up to be completely
>>>>> sorted, the total cost is probably not much higher than what an initial
>>>>> sort would have done. It will be higher since their are some
>>>>> bookkeeping
>>>>> to do, but not so much higher I think. Doing one call corresponds to
>>>>> resuming the partial sort from the state resulting from previous calls.
>>>>>
>>>>> Did I miss something ?
>>>>>
>>>>>
>>>>>> *
>>>>>> *
>>>>>> * add a method to probe multiple quantiles at the same time. This
>>>>>> potentially uses less memory than the first approach, but is
>>>>>> dependent on
>>>>>> the user calling the right method.*
>>>>>
>>>>> Since there are different use case, having several methods seems
>>>>> fair. A
>>>>> multiple quantiles method is a good idea.
>>>>>
>>>>>> *
>>>>>> *
>>>>>> * use the online algorithm mentioned above with two passes, one
to
>>>>>> estimate the quantiles of interest, a second to refine the estimate
>>>>>> using
>>>>>> the actual data. This allows the multiple probe method to be linear
>>>>>> in time
>>>>>> and should give you exact sample quantiles. It doesn't help the
>>>>>> repeated
>>>>>> probe problem.*
>>>>>
>>>>> I'm not sure I understand well. However, the online method by itself
>>>>> would be an interesting addition. It would allow quantiles computation
>>>>> with the Storeless versions of our classes.
>>>>
>>>> +1  definitely valuable for the storeless case.
>>>>
>>>> I need to think about this some more / see some patches; but my initial
>>>> reaction is that we can get a big bang from just doing what the user did
>>>> (caching the sorted array and inserting into it when addValue is called)
>>>> for the datainmemory impls. I would suggest implementing that for for
>>>> Percentile itself (taking care to handle addValue and rolling windows
>>>> correctly) and DescriptiveStatistics;
>>>
>>> This would however handle only the multiple quantiles use case, not the
>>> case where only one quantile is computed that could be provided with a
>>> selection in O(n) rather than a sort in O(n ln(n)).
>>>
>>
>> Correct. Ideal would be to do the sort incrementally as you describe
>> for the datainmemory impl. Would be tricky to implement in
>> Percentile, given array index parameters and the need to handle addValue
>> and window in DescriptiveStatistics; but this should be doable.
>>
>> Lets open 2 JIRAs for this  one straight performance improvement for
>> Percentile and the other for a storeless percentile estimator.
>
> OK. I'll open them in a few minutes.
The issues are <https://issues.apache.org/jira/browse/MATH417> and
<https://issues.apache.org/jira/browse/MATH418>. I have set the target
version to 2.2 for the first one to to 3.0 for the second one.
I'll give a look at the first one after finishing my current work
(implementing some new random generators).
Luc
>
> Luc
>
>>
>> Phil
>>
>>
>>> Luc
>>>
>>>> but add a new (storelesss)
>>>> statistic based on an incremental quantile estimation algorithm and make
>>>> this accessible via the storeless aggregate, SummaryStatistics. It
>>>> might be interesting to include this optionally with
>>>> DescriptiveStatistics as well, so that in the rolling window case you
>>>> could compare the current window distribution to the full sample.
>>>>
>>>> Phil
>>>>
>>>>
>>>>>
>>>>> Luc
>>>>>
>>>>>> *
>>>>>> *
>>>>>> On Fri, Sep 17, 2010 at 10:34 AM, Luc
>>>>>> Maisonobe<Luc.Maisonobe@free.fr>wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> During a recent face to face discussion with some commonsmath
users
>>>>>>> from a large project, it appeared one implementation choice for
>>>>>>> Percentile statistics was a huge performance bottleneck.
>>>>>>>
>>>>>>> When the Percentile.evaluate method is called, the sample array
is
>>>>>>> copied and sorted. In one use case, 100 calls were made for each
>>>>>>> sample,
>>>>>>> so the same array was sorted 100 times. In another use case,
only one
>>>>>>> call is made (typically to compute the median) but the user wondered
>>>>>>> why
>>>>>>> the array should be completely sorted. In fact, using a selection
>>>>>>> algorithm instead of a sorting algorithm would be sufficient.
>>>>>>>
>>>>>>> I would like to have you opinion about providing a different
evaluate
>>>>>>> method that would process an array provided previously and use
only
>>>>>>> selections to provide the percentile. Consider for example the
>>>>>>> following
>>>>>>> input array:
>>>>>>>
>>>>>>> [ 6, 8, 4, 5, 0, 2, 7 ]
>>>>>>>
>>>>>>> If we first as for the median, a selection algorithm may
>>>>>>> reorganize the
>>>>>>> array as:
>>>>>>>
>>>>>>> [ 6, 0, 5, 2, 8, 4, 7 ]
>>>>>>>
>>>>>>> were the left part is smaller than the central element, the right
>>>>>>> part
>>>>>>> is larger and hence the central element 2 is known to be the
median.
>>>>>>>
>>>>>>> Then we ask for another value, the 25% percentile. Since the
object
>>>>>>> already knows the smaller elements are in the left part, it can
use a
>>>>>>> select algorithm on the 3 elements left part and extract the
value 5
>>>>>>> without even trying to sort the upper part of the array.
>>>>>>>
>>>>>>> What do you think ?
>>>>>>>
>>>>>>> Luc
>>>>>>>
>>>>>>> 
>>>>>>> To unsubscribe, email: devunsubscribe@commons.apache.org
>>>>>>> For additional commands, email: devhelp@commons.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> 
>>>>> To unsubscribe, email: devunsubscribe@commons.apache.org
>>>>> For additional commands, email: devhelp@commons.apache.org
>>>>>
>>>>
>>>>
>>>> 
>>>> To unsubscribe, email: devunsubscribe@commons.apache.org
>>>> For additional commands, email: devhelp@commons.apache.org
>>>>
>>>
>>>
>>> 
>>> To unsubscribe, email: devunsubscribe@commons.apache.org
>>> For additional commands, email: devhelp@commons.apache.org
>>>
>>
>>
>> 
>> To unsubscribe, email: devunsubscribe@commons.apache.org
>> For additional commands, email: devhelp@commons.apache.org
>>
>
>
> 
> To unsubscribe, email: devunsubscribe@commons.apache.org
> For additional commands, email: devhelp@commons.apache.org
>
>

To unsubscribe, email: devunsubscribe@commons.apache.org
For additional commands, email: devhelp@commons.apache.org
