commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luc Maisonobe <Luc.Maison...@free.fr>
Subject Re: [math] speeding up percentile based statistics
Date Sun, 03 Oct 2010 19:55:13 GMT
Le 26/09/2010 21:34, Luc Maisonobe a écrit :
> Le 26/09/2010 21:05, Phil Steitz a écrit :
>> On 9/26/10 2:30 PM, Luc Maisonobe wrote:
>>> Le 26/09/2010 19:27, Mikkel Meyer Andersen a écrit :
>>>> 2010/9/26 Gilles Sadowski<gilles@harfang.homelinux.org>:
>>>>>> [...]
>>>>>>
>>>>>>   1) do nothing to check the array is the same between calls and
>>>>>> blindly
>>>>>>      assumes it IS the same. Users would really need to call
>>>>>> clearCache
>>>>>>      when they provide a new array
>>>>>>      pros: very simple
>>>>>>      cons: error-prone for the user as it relies only on reading
and
>>>>>>      understanding a documentation that would change with new version
>>>>>>
>>>>>>   2) check only array reference equality (with ==) to check the array
>>>>>>      is the same as the previous call or not. Users would need to
call
>>>>>>      clearCache only if they use the same array but have changed
its
>>>>>>      content.
>>>>>>      pros: trade-off between efficiency and reliability,
>>>>>>            handles most cases properly
>>>>>>      cons: may be wrong in corner cases
>>>>>>
>>>>>>   3) check array content using an hash code. Users would generally
>>>>>> don't
>>>>>>      need to call clearCache at all so it would not be provided
>>>>>>      pros: works in all cases
>>>>>>      cons: add linear cost for array checking
>>>>>>
>>>>>>   4) remove the double[] values parameter from the API and use a
>>>>>> separate
>>>>>>      addValues method, hence the class will reset its cache only
when
>>>>>>      addValues is called
>>>>>>      pros: works in all cases
>>>>>>      cons: Percentile could not implement UnivariateStatistic anymore
>>>>>>
>>>>>> My preference is choice 2.
>>>>>>
>>>>>> What do you think ?
>>>>>
>>>>> IIUC, the interface method ("evaluate") was designed around the
>>>>> assumption
>>>>> that successive calls are independent: the array argument is not
>>>>> encapsulated for further processing (e.g. caching).
>>>>> IMO, the equality check and the "clearCache" method look like
>>>>> workarounds,
>>>>> not a clean solution.
>>>>> As an example, what if a user calls "evaluate" several times
>>>>> alternating on
>>>>> two different arrays:
>>>>>
>>>>> ---CUT---
>>>>>     double[] a = new double[] {1, 2, 3};
>>>>>     double[] b = new double[] {1, 2, 3, 4, 5};
>>>>>     Percentile p = new Percentile();
>>>>>
>>>>>     dourble r;
>>>>>     r = p.evaluate(a, 50);
>>>>>     r = p.evaluate(b, 50);
>>>>>     r = p.evaluate(a, 50);
>>>>>     r = p.evaluate(b, 50);
>>>>>
>>>>>     // etc.
>>>>> ---CUT---
>>>>>
>>>>> Doesn't this kind of use-case nullify the expected optimization?
>>>>>
>>>>> If the array is going to be reused, the user call should reflect
>>>>> that fact;
>>>>> e.g. by passing the array in a constructor:
>>>>>
>>>>> ---CUT---
>>>>>     double[] a = new double[] {1, 2, 3};
>>>>>     double[] b = new double[] {1, 2, 3, 4, 5};
>>>>>     Percentile pA = new Percentile(a);
>>>>>     Percentile pB = new Percentile(b);
>>>>>
>>>>>     double r;
>>>>>     r = pA.evaluate(50);
>>>>>     r = pB.evaluate(50);
>>>>>     r = pA.evaluate(50);
>>>>>     r = pB.evaluate(50);
>>>>> ---CUT---
>>>>>
>>>>> That way, later calls can benefit from whatever preprocessing was
>>>>> done in
>>>>> previous calls.
>>>>> The instance will always control all the information needed (e.g.
>>>>> after a
>>>>> call to an "addValues" method) for the processing without the need
>>>>> to rely
>>>>> on the user for calling "clearCache" whenever necessary.
>>>>>
>>>>>
>>>>> Gilles
>>>>
>>>> +1
>>>> I think that is a really good idea and I agree on the points made.
>>>
>>> This proposal is point 4. It breaks the UnivariateStatistic API and it
>>> breaks what the user found interesting in this API, i.e. have a general
>>> statistics framework where one statistic can be replaced by another one.
>>>
>>> If you read again one of my earlier messages from today, we will combine
>>> this method (i.e. evaluate without values) and the
>>> UnivariateStatistics API.
>>>
>>> Perhaps we could add these new methods (i.e. addValues and evaluate
>>> without values) to UnivariateStatistics, but this can only be done on
>>> 3.0.
>>>
>>
>> We could add to AbstractUnivariateStatistics in 2.x and update the
>> interface in 3.0, if we can work out and agree on the setData() idea in
>> my last post.  While only some statistics can benefit internally from
>> caching, I think the setData idea is consistent with
>> UnivariateStatistics, as this is the interface for stats that require
>> the full set of data to be stored in memory.

I am trying to set up the setData(double[] values) and setData(double[]
values, int begin, int length) methods at AbstractUnivariateStatistics
level. This adds a field to the abstract class, and this field must be
copied by the various copy methods since it belongs to the instance
state. Each statistic implements two copy method, an instance method
which create a new instance and is declared by the top level
UnivariateStatistics interface and a static one that overwrite an
existing instance. They are all implemented by similar code, as shonw
here for the Percentile case:

public Percentile copy() {
    Percentile result = new Percentile();
    copy(this, result);
    return result;
}

public static void copy(Percentile source, Percentile dest) {
    dest.quantile = source.quantile;
}

How can I implement simply all these methods and make sure the stored
data array in the abstract base class is copied too ?

One method would be to add a static copy at abstract class level and
change the static methods at class hierarchy leaf call this method. For
Percentile, this would mean:

public static void copy(Percentile source, Percentile dest) {
    AbstractUnivariateStatistics.copy(source, dest);
    dest.quantile = source.quantile;
}

I don't like this method much, it is too easy to forget calling the base
class copy part. Preserving the static copy methods is important and if
theses static methods work well, neither the instance copy methods nor
the copy constructors need to be changed.

So despite I don't like adding a static
AbstractUnivariateStatistics.copy method, I think I will be forced to do it.

Does anybody have a better solution ?

Luc

> 
> This is fine for me.
> 
> Luc
> 
>>
>> Phil
>>> Luc
>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message