commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikkel Meyer Andersen <>
Subject Re: [math] speeding up percentile based statistics
Date Sun, 26 Sep 2010 15:39:20 GMT

Why not simply make 3) the default, but also supply a
*IMSureWhatIMDoing-method only implementing 2) (it is so cheap that we
might as well do it instead of not doing any check at all). This means
that users who read the documentation can gain something when they
explicitly ask for it, while users who don't read the docs are still

Cheers, Mikkel.

2010/9/26 Luc Maisonobe <>:
> Le 23/09/2010 20:57, Luc Maisonobe a écrit :
>> Hi,
>> Resuming this thread, as I will be soon able to work on this (I hope).
>> I propose to start by solving
>> <> using a simple
>> partition-based selection algorithm and preserving pivot information.
>> This would help both the single call use-case (we don't sort everything)
>> and the multiple call case (we don't redo on subsequent calls work
>> already done in the first calls.
> The current API of the Percentile method is derived from
> AbstractUnivariateStatistic and provides several evaluate methods that
> have the data array as a parameter.
> This is compatible with the one-call use case without any semantic
> change, but does not fit the multiple-calls use case seamlessly.
> If we want to cache some work already done on the array between calls
> with different values of p (the percentile value), we have to check if
> the array is still the same as the cached one on which we have done some
> partitioning and saved some pivots. This can be done by storing a
> reference to the original array in addition to a copy of its content
> (which we reorganize) and with an "if (values == originalValues)"
> statement. This would however not protect against changes in the array
> elements themselves done by the user between calls. A better protection
> could be to compute a hash code of the array content and compare it to
> the original one at each call. However, this would add a linear cost.
> The user who encountered the performances problems added an addValues
> method to set up the array once and simply ignored the values passed as
> an argument to the evaluate methods. This breaks the semantics of the API.
> So I see four choices about partitioning caching and would like to have
> people express which one they prefer. A clearCache or similar public
> method would be made available to users to ensure clean computation if
> the caching mechanism gets on their way.
>  1) do nothing to check the array is the same between calls and blindly
>    assumes it IS the same. Users would really need to call clearCache
>    when they provide a new array
>    pros: very simple
>    cons: error-prone for the user as it relies only on reading and
>    understanding a documentation that would change with new version
>  2) check only array reference equality (with ==) to check the array
>    is the same as the previous call or not. Users would need to call
>    clearCache only if they use the same array but have changed its
>    content.
>    pros: trade-off between efficiency and reliability,
>          handles most cases properly
>    cons: may be wrong in corner cases
>  3) check array content using an hash code. Users would generally don't
>    need to call clearCache at all so it would not be provided
>    pros: works in all cases
>    cons: add linear cost for array checking
>  4) remove the double[] values parameter from the API and use a separate
>    addValues method, hence the class will reset its cache only when
>    addValues is called
>    pros: works in all cases
>    cons: Percentile could not implement UnivariateStatistic anymore
> My preference is choice 2.
> What do you think ?
> Luc
>> Partition-based algorithms are O(n) in expected time but not in worst
>> case time. If end users ask for it later, we could improve again by
>> adding the Musser's introselect that guarantees O(n) behavior even in
>> worst case.
>> Another possibility would be to have an additional method that would
>> compute percentiles for several thresholds at once, but this would
>> require returning an array of results and I think it would be difficult
>> to mix with other statistics (mainly the DescriptiveStatistics.apply
>> method, which is used by the users I talked to).
>> Would this meet everyone concerns ?
>> Luc
>> Le 18/09/2010 21:50, Dimitri Pourbaix a écrit :
>>> Ted,
>>>> O(n), not n
>>>> Expected case is n + n/2 + n/4 ... < 2n
>>> Yes, that I am already more incline to buy.
>>> Dim.
>>> ----------------------------------------------------------------------------
>>> Dimitri Pourbaix                         *
>>> Institut d'Astronomie et d'Astrophysique *      Don't worry, be happy
>>> CP 226, office 2.N4.211, building NO     *         and CARPE DIEM.
>>> Universite Libre de Bruxelles            *
>>> Boulevard du Triomphe                    *      Tel : +32-2-650.35.71
>>>  B-1050 Bruxelles                        *      Fax : +32-2-650.42.26
>>>     *
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message