commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Steitz <phil.ste...@gmail.com>
Subject Re: [math] speeding up percentile based statistics
Date Mon, 04 Oct 2010 00:52:34 GMT
On 10/3/10 3:55 PM, Luc Maisonobe wrote:
> Le 26/09/2010 21:34, Luc Maisonobe a écrit :
>> Le 26/09/2010 21:05, Phil Steitz a écrit :
>>> On 9/26/10 2:30 PM, Luc Maisonobe wrote:
>>>> Le 26/09/2010 19:27, Mikkel Meyer Andersen a écrit :
>>>>> 2010/9/26 Gilles Sadowski<gilles@harfang.homelinux.org>:
>>>>>>> [...]
>>>>>>>
>>>>>>>    1) do nothing to check the array is the same between calls
and
>>>>>>> blindly
>>>>>>>       assumes it IS the same. Users would really need to call
>>>>>>> clearCache
>>>>>>>       when they provide a new array
>>>>>>>       pros: very simple
>>>>>>>       cons: error-prone for the user as it relies only on reading
and
>>>>>>>       understanding a documentation that would change with new
version
>>>>>>>
>>>>>>>    2) check only array reference equality (with ==) to check
the array
>>>>>>>       is the same as the previous call or not. Users would need
to call
>>>>>>>       clearCache only if they use the same array but have changed
its
>>>>>>>       content.
>>>>>>>       pros: trade-off between efficiency and reliability,
>>>>>>>             handles most cases properly
>>>>>>>       cons: may be wrong in corner cases
>>>>>>>
>>>>>>>    3) check array content using an hash code. Users would generally
>>>>>>> don't
>>>>>>>       need to call clearCache at all so it would not be provided
>>>>>>>       pros: works in all cases
>>>>>>>       cons: add linear cost for array checking
>>>>>>>
>>>>>>>    4) remove the double[] values parameter from the API and use
a
>>>>>>> separate
>>>>>>>       addValues method, hence the class will reset its cache
only when
>>>>>>>       addValues is called
>>>>>>>       pros: works in all cases
>>>>>>>       cons: Percentile could not implement UnivariateStatistic
anymore
>>>>>>>
>>>>>>> My preference is choice 2.
>>>>>>>
>>>>>>> What do you think ?
>>>>>>
>>>>>> IIUC, the interface method ("evaluate") was designed around the
>>>>>> assumption
>>>>>> that successive calls are independent: the array argument is not
>>>>>> encapsulated for further processing (e.g. caching).
>>>>>> IMO, the equality check and the "clearCache" method look like
>>>>>> workarounds,
>>>>>> not a clean solution.
>>>>>> As an example, what if a user calls "evaluate" several times
>>>>>> alternating on
>>>>>> two different arrays:
>>>>>>
>>>>>> ---CUT---
>>>>>>      double[] a = new double[] {1, 2, 3};
>>>>>>      double[] b = new double[] {1, 2, 3, 4, 5};
>>>>>>      Percentile p = new Percentile();
>>>>>>
>>>>>>      dourble r;
>>>>>>      r = p.evaluate(a, 50);
>>>>>>      r = p.evaluate(b, 50);
>>>>>>      r = p.evaluate(a, 50);
>>>>>>      r = p.evaluate(b, 50);
>>>>>>
>>>>>>      // etc.
>>>>>> ---CUT---
>>>>>>
>>>>>> Doesn't this kind of use-case nullify the expected optimization?
>>>>>>
>>>>>> If the array is going to be reused, the user call should reflect
>>>>>> that fact;
>>>>>> e.g. by passing the array in a constructor:
>>>>>>
>>>>>> ---CUT---
>>>>>>      double[] a = new double[] {1, 2, 3};
>>>>>>      double[] b = new double[] {1, 2, 3, 4, 5};
>>>>>>      Percentile pA = new Percentile(a);
>>>>>>      Percentile pB = new Percentile(b);
>>>>>>
>>>>>>      double r;
>>>>>>      r = pA.evaluate(50);
>>>>>>      r = pB.evaluate(50);
>>>>>>      r = pA.evaluate(50);
>>>>>>      r = pB.evaluate(50);
>>>>>> ---CUT---
>>>>>>
>>>>>> That way, later calls can benefit from whatever preprocessing was
>>>>>> done in
>>>>>> previous calls.
>>>>>> The instance will always control all the information needed (e.g.
>>>>>> after a
>>>>>> call to an "addValues" method) for the processing without the need
>>>>>> to rely
>>>>>> on the user for calling "clearCache" whenever necessary.
>>>>>>
>>>>>>
>>>>>> Gilles
>>>>>
>>>>> +1
>>>>> I think that is a really good idea and I agree on the points made.
>>>>
>>>> This proposal is point 4. It breaks the UnivariateStatistic API and it
>>>> breaks what the user found interesting in this API, i.e. have a general
>>>> statistics framework where one statistic can be replaced by another one.
>>>>
>>>> If you read again one of my earlier messages from today, we will combine
>>>> this method (i.e. evaluate without values) and the
>>>> UnivariateStatistics API.
>>>>
>>>> Perhaps we could add these new methods (i.e. addValues and evaluate
>>>> without values) to UnivariateStatistics, but this can only be done on
>>>> 3.0.
>>>>
>>>
>>> We could add to AbstractUnivariateStatistics in 2.x and update the
>>> interface in 3.0, if we can work out and agree on the setData() idea in
>>> my last post.  While only some statistics can benefit internally from
>>> caching, I think the setData idea is consistent with
>>> UnivariateStatistics, as this is the interface for stats that require
>>> the full set of data to be stored in memory.
>
> I am trying to set up the setData(double[] values) and setData(double[]
> values, int begin, int length) methods at AbstractUnivariateStatistics
> level. This adds a field to the abstract class, and this field must be
> copied by the various copy methods since it belongs to the instance
> state. Each statistic implements two copy method, an instance method
> which create a new instance and is declared by the top level
> UnivariateStatistics interface and a static one that overwrite an
> existing instance. They are all implemented by similar code, as shonw
> here for the Percentile case:
>
> public Percentile copy() {
>      Percentile result = new Percentile();
>      copy(this, result);
>      return result;
> }
>
> public static void copy(Percentile source, Percentile dest) {
>      dest.quantile = source.quantile;
> }
>
> How can I implement simply all these methods and make sure the stored
> data array in the abstract base class is copied too ?
>
> One method would be to add a static copy at abstract class level and
> change the static methods at class hierarchy leaf call this method. For
> Percentile, this would mean:
>
> public static void copy(Percentile source, Percentile dest) {
>      AbstractUnivariateStatistics.copy(source, dest);
>      dest.quantile = source.quantile;
> }
>
> I don't like this method much, it is too easy to forget calling the base
> class copy part. Preserving the static copy methods is important and if
> theses static methods work well, neither the instance copy methods nor
> the copy constructors need to be changed.
>
> So despite I don't like adding a static
> AbstractUnivariateStatistics.copy method, I think I will be forced to do it.
>
> Does anybody have a better solution ?

I don't know if this is any better, but it seems to me that you are 
in any case going to need a getData method (unless the new data 
field is protected), so you could provide a default impl of setData 
and getData in AbstractUnivariateStatistics and then just use these 
directly in copy

public static void copy(Percentile source, Percentile dest) {
       dest.setData(source.getData());
       dest.quantile = source.quantile;
}

I guess the first line above is your 
AbstractUnivariateStatistics.copy method, so this is really no 
different.

We do want to keep the copy stuff working.  Maybe someone else has a 
better idea...

Phil
>
> Luc
>
>>
>> This is fine for me.
>>
>> Luc
>>
>>>
>>> Phil
>>>> Luc
>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message