commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luc Maisonobe <Luc.Maison...@free.fr>
Subject Re: [math] speeding up percentile based statistics
Date Sun, 26 Sep 2010 19:34:29 GMT
Le 26/09/2010 21:05, Phil Steitz a écrit :
> On 9/26/10 2:30 PM, Luc Maisonobe wrote:
>> Le 26/09/2010 19:27, Mikkel Meyer Andersen a écrit :
>>> 2010/9/26 Gilles Sadowski<gilles@harfang.homelinux.org>:
>>>>> [...]
>>>>>
>>>>>   1) do nothing to check the array is the same between calls and
>>>>> blindly
>>>>>      assumes it IS the same. Users would really need to call
>>>>> clearCache
>>>>>      when they provide a new array
>>>>>      pros: very simple
>>>>>      cons: error-prone for the user as it relies only on reading and
>>>>>      understanding a documentation that would change with new version
>>>>>
>>>>>   2) check only array reference equality (with ==) to check the array
>>>>>      is the same as the previous call or not. Users would need to call
>>>>>      clearCache only if they use the same array but have changed its
>>>>>      content.
>>>>>      pros: trade-off between efficiency and reliability,
>>>>>            handles most cases properly
>>>>>      cons: may be wrong in corner cases
>>>>>
>>>>>   3) check array content using an hash code. Users would generally
>>>>> don't
>>>>>      need to call clearCache at all so it would not be provided
>>>>>      pros: works in all cases
>>>>>      cons: add linear cost for array checking
>>>>>
>>>>>   4) remove the double[] values parameter from the API and use a
>>>>> separate
>>>>>      addValues method, hence the class will reset its cache only when
>>>>>      addValues is called
>>>>>      pros: works in all cases
>>>>>      cons: Percentile could not implement UnivariateStatistic anymore
>>>>>
>>>>> My preference is choice 2.
>>>>>
>>>>> What do you think ?
>>>>
>>>> IIUC, the interface method ("evaluate") was designed around the
>>>> assumption
>>>> that successive calls are independent: the array argument is not
>>>> encapsulated for further processing (e.g. caching).
>>>> IMO, the equality check and the "clearCache" method look like
>>>> workarounds,
>>>> not a clean solution.
>>>> As an example, what if a user calls "evaluate" several times
>>>> alternating on
>>>> two different arrays:
>>>>
>>>> ---CUT---
>>>>     double[] a = new double[] {1, 2, 3};
>>>>     double[] b = new double[] {1, 2, 3, 4, 5};
>>>>     Percentile p = new Percentile();
>>>>
>>>>     dourble r;
>>>>     r = p.evaluate(a, 50);
>>>>     r = p.evaluate(b, 50);
>>>>     r = p.evaluate(a, 50);
>>>>     r = p.evaluate(b, 50);
>>>>
>>>>     // etc.
>>>> ---CUT---
>>>>
>>>> Doesn't this kind of use-case nullify the expected optimization?
>>>>
>>>> If the array is going to be reused, the user call should reflect
>>>> that fact;
>>>> e.g. by passing the array in a constructor:
>>>>
>>>> ---CUT---
>>>>     double[] a = new double[] {1, 2, 3};
>>>>     double[] b = new double[] {1, 2, 3, 4, 5};
>>>>     Percentile pA = new Percentile(a);
>>>>     Percentile pB = new Percentile(b);
>>>>
>>>>     double r;
>>>>     r = pA.evaluate(50);
>>>>     r = pB.evaluate(50);
>>>>     r = pA.evaluate(50);
>>>>     r = pB.evaluate(50);
>>>> ---CUT---
>>>>
>>>> That way, later calls can benefit from whatever preprocessing was
>>>> done in
>>>> previous calls.
>>>> The instance will always control all the information needed (e.g.
>>>> after a
>>>> call to an "addValues" method) for the processing without the need
>>>> to rely
>>>> on the user for calling "clearCache" whenever necessary.
>>>>
>>>>
>>>> Gilles
>>>
>>> +1
>>> I think that is a really good idea and I agree on the points made.
>>
>> This proposal is point 4. It breaks the UnivariateStatistic API and it
>> breaks what the user found interesting in this API, i.e. have a general
>> statistics framework where one statistic can be replaced by another one.
>>
>> If you read again one of my earlier messages from today, we will combine
>> this method (i.e. evaluate without values) and the
>> UnivariateStatistics API.
>>
>> Perhaps we could add these new methods (i.e. addValues and evaluate
>> without values) to UnivariateStatistics, but this can only be done on
>> 3.0.
>>
> 
> We could add to AbstractUnivariateStatistics in 2.x and update the
> interface in 3.0, if we can work out and agree on the setData() idea in
> my last post.  While only some statistics can benefit internally from
> caching, I think the setData idea is consistent with
> UnivariateStatistics, as this is the interface for stats that require
> the full set of data to be stored in memory.

This is fine for me.

Luc

> 
> Phil
>> Luc
>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message