commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luc Maisonobe <Luc.Maison...@free.fr>
Subject Re: [math] speeding up percentile based statistics
Date Sun, 26 Sep 2010 15:01:58 GMT
Le 23/09/2010 20:57, Luc Maisonobe a écrit :
> Hi,
> 
> Resuming this thread, as I will be soon able to work on this (I hope).
> I propose to start by solving
> <https://issues.apache.org/jira/browse/MATH-417> using a simple
> partition-based selection algorithm and preserving pivot information.
> This would help both the single call use-case (we don't sort everything)
> and the multiple call case (we don't redo on subsequent calls work
> already done in the first calls.

The current API of the Percentile method is derived from
AbstractUnivariateStatistic and provides several evaluate methods that
have the data array as a parameter.

This is compatible with the one-call use case without any semantic
change, but does not fit the multiple-calls use case seamlessly.

If we want to cache some work already done on the array between calls
with different values of p (the percentile value), we have to check if
the array is still the same as the cached one on which we have done some
partitioning and saved some pivots. This can be done by storing a
reference to the original array in addition to a copy of its content
(which we reorganize) and with an "if (values == originalValues)"
statement. This would however not protect against changes in the array
elements themselves done by the user between calls. A better protection
could be to compute a hash code of the array content and compare it to
the original one at each call. However, this would add a linear cost.
The user who encountered the performances problems added an addValues
method to set up the array once and simply ignored the values passed as
an argument to the evaluate methods. This breaks the semantics of the API.

So I see four choices about partitioning caching and would like to have
people express which one they prefer. A clearCache or similar public
method would be made available to users to ensure clean computation if
the caching mechanism gets on their way.

 1) do nothing to check the array is the same between calls and blindly
    assumes it IS the same. Users would really need to call clearCache
    when they provide a new array
    pros: very simple
    cons: error-prone for the user as it relies only on reading and
    understanding a documentation that would change with new version

 2) check only array reference equality (with ==) to check the array
    is the same as the previous call or not. Users would need to call
    clearCache only if they use the same array but have changed its
    content.
    pros: trade-off between efficiency and reliability,
          handles most cases properly
    cons: may be wrong in corner cases

 3) check array content using an hash code. Users would generally don't
    need to call clearCache at all so it would not be provided
    pros: works in all cases
    cons: add linear cost for array checking

 4) remove the double[] values parameter from the API and use a separate
    addValues method, hence the class will reset its cache only when
    addValues is called
    pros: works in all cases
    cons: Percentile could not implement UnivariateStatistic anymore

My preference is choice 2.

What do you think ?

Luc

> 
> Partition-based algorithms are O(n) in expected time but not in worst
> case time. If end users ask for it later, we could improve again by
> adding the Musser's introselect that guarantees O(n) behavior even in
> worst case.
> 
> Another possibility would be to have an additional method that would
> compute percentiles for several thresholds at once, but this would
> require returning an array of results and I think it would be difficult
> to mix with other statistics (mainly the DescriptiveStatistics.apply
> method, which is used by the users I talked to).
> 
> Would this meet everyone concerns ?
> Luc
> 
> 
> Le 18/09/2010 21:50, Dimitri Pourbaix a écrit :
>> Ted,
>>
>>> O(n), not n
>>>
>>> Expected case is n + n/2 + n/4 ... < 2n
>>
>> Yes, that I am already more incline to buy.
>>
>> Dim.
>> ----------------------------------------------------------------------------
>>
>> Dimitri Pourbaix                         *
>> Institut d'Astronomie et d'Astrophysique *      Don't worry, be happy
>> CP 226, office 2.N4.211, building NO     *         and CARPE DIEM.
>> Universite Libre de Bruxelles            *
>> Boulevard du Triomphe                    *      Tel : +32-2-650.35.71
>>  B-1050 Bruxelles                        *      Fax : +32-2-650.42.26
>> http://sb9.astro.ulb.ac.be/~pourbaix     * mailto:pourbaix@astro.ulb.ac.be
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message