commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luc Maisonobe <...@spaceroots.org>
Subject Re: [math] use case for NaNStrategy.FIXED?
Date Wed, 25 Jun 2014 07:12:52 GMT
Le 25/06/2014 06:05, venkatesha murthy a écrit :
> On Wed, Jun 25, 2014 at 12:21 AM, Luc Maisonobe <luc@spaceroots.org> wrote:
> 
>> Hi Venkat,
>>
>> Le 23/06/2014 21:08, venkatesha murthy a écrit :
>>> On Tue, Jun 24, 2014 at 12:08 AM, Luc Maisonobe <luc@spaceroots.org>
>> wrote:
>>>> Hi all,
>>>>
>>>> While looking further in Percentile class for MATH-1120, I have found
>>>> another problem in the current implementation. NaNStrategy.FIXED should
>>>> leave the NaNs in place, but at the end of the KthSelector.select
>>>> method, a call to Arrays.sort moves the NaNs to the end of the small
>>>> sub-array. What is really implemented here is therefore closer to
>>>> NaNStrategy.MAXIMAL than NaNStrategy.FIXED. This always occur in the
>>>> test cases because they use very short arrays, and we directly switch to
>>>> this part of the select method.
>>> Are NaNs considered higher than +Inf ?
>>> If MAXIMAL represent replacing for +inf ; you need something to
>>> indicate beyond this for NaN.
>>
>> Well, we can just keep the NaN themselves and handled them
>> appropriately, hoping not to trash performances too much.
>>
> Agreed.
> 
>>
>>> What is the test input you see an issue and what is the actual error
>>> you are seeing. Please share the test case.
>>
>> Just look at PercentileTest.testReplaceNanInRange(). The first check in
>> the test corresponds to a Percentile configuration at 50% percentile,
>> and NaNStrategy.FIXED. The array has an odd number of entries, so the
>> 50% percentile is exactly one of the entries: the one at index 5 in the
>> final array.
>>
>> The initial ordering is { 0, 1, 2, 3, 4, NaN, NaN, 5, 7, NaN, 8 }. So
>> for the NaNStrategy.FIXED setting, it should not be modified at all in
>> the selection algorithm and the result for 50% should be the first NaN
>> of the array, at index 5. In fact, due to the Arrays.sort, we *do*
>> reorder the array into  { 0, 1, 2, 3, 4, 5, 7, 8, NaN, NaN, NaN }, so
>> the result is 5.
>>
>> Agreed. just verified by putting back the earlier insertionSort function.
> 
> 
>> If we use NaNStrategy.MAXIMAL and any quantile above 67%, we get as a
>> result Double.POSITIVE_INFINITY instead of Double.NaN.
>>
>> If we agree to leave FIXED as unchanged behaviour with your insertionSort
> code; then atleast MAXIMAL/MINIMAL should be allowed for transformation of
> NaN to +/-Inf

I'm fine with it, as long as we clearly state it in the NaNStrategy
documentation, saying somethig like:

 When MAXIMAL/MINIMAL are used, the NaNs are effecively *replaced* by
 +/- infinity, hence the results will be computed as if infinities were
 there in the first place.

> 
>>>>
>>>> When I implemented the method, I explicitly avoided calling Arrays.sort
>>>> because it is a general purpose method that is know to be efficient only
>>>> for arrays of at least medium size. In most cases, when the array is
>>>> small one falls back to a non-recursive method like a very simple
>>>> insertion sort, which is faster for smaller arrays.
>>>
>>> Please help me understand here; even java primitive Arrays.sort does
>>> an insertion sort for less than 7 elements
>>> (Refer sort1(double x[], int off, int len))
>>> So what is it that the custom insertion sort that does differently or
>>> is advantageous. Is it the value 15 elements?
>>
>> I don't see a reference to 7 elements, neither in the Java6 nor in the
>> Java 7 doc
> 
> Please take a look at the sort1 method where there is a first block in the
> code which clearly mentions len < 7
>     /**
>      * Sorts the specified sub-array of doubles into ascending order.
>      */
>     private static void sort1(double x[], int off, int len) {
>     // Insertion sort on smallest arrays
>     if (len < 7) {
>         for (int i=off; i<len+off; i++)
>         for (int j=i; j>off && x[j-1]>x[j]; j--)
>             swap(x, j, j-1);
>         return;
>     }
>     :
>     :
>     : code continues for the else part
> 
> Also the grepcode url
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/Arrays.java#Arrays.sort1%28double[]%2Cint%2Cint%29>
> indicates the same

OK, you were refering to a specific implementation. I was thinking in
the general case.

> 
> (and in any case the doc explicitly states the algorithms
>> explained are only implementation notes and are not part of the
>> specification).
>>
> Yes its a part of comments anyways.
> 
>>
>> However, the most important part for now is the fact that we control it
>> and may be able to implement different NaN strategies. What we have
>> currently fails.
>>
>> I agree on this  and hence here is my take:
> Leave FIXED as-is and use the earlier insertionSort code (just change the
> name to sort rather than hardcoding it as insertionsort) to handle the case
> you were mentioning

If we leave FIXED it as it is done know, even with insertionSort we do
not really control what happens. It is deterministic as the pivoting and
if/else structure is well defined (both in the selection part and in the
final sort for sub-arrays), but it is untrackable so we can't document it.

> Continue to use MAXIMAL/MINIMAL for +/-Inf transformation and that way we
> have covered both Inf and nan cases.

OK.

> Use REMOVED as default for all Percentile Estimation Types. (mostly
> influenced by R here perhaps)

This would be fine with me. I have concerns with FIXED only, the other
strategies all seem quite realistic.

Does anybody else has an advice for FIXED? What are the use case?

best regards,
Luc

> 
> best regards,
>> Luc
>>
>>>
>>>> In the select
>>>> operation, we know we have small sub-arrays at the call point. Going
>>>> back to the former insertionSort would recover the good behavior for
>>>> small arrays, but would in fact not be sufficient to really implement a
>>>> NaNStrategy.FIXED. I guess it would be simple to make it behave like
>>>> NaNStrategy.MAXIMAL but I did not try yet.
>>>>
>>>> My point here is rather: can we really and should we really implement
>>>> NaNStrategy.FIXED? Looking at how it is used elsewhere, it needs to
>>>> store the original position of the NaNs. It is quite cumbersome.
>>>>
>>>> I wonder what is the use case for NaNStrategy.FIXED at all.
>>>>
>>>> Going further and looking at the use cases for other NaNStrategy values,
>>>> the NaNs are replaced by +/- infinity before sorting, which is OK for
>>>> ranking as we only rely on the indices, but we use the values themselves
>>>> in Percentile. So sometimes, we end up with computing interpolations
>>>> like 0.5 * (x[k] + x[k+1]) or similar. If x[k] is a finite number and
>>>> x[k+1] has been changed to +infinity, we get +infinity, instead of the
>>>> NaN we should have retrieved without replacement. So here again, I'm not
>>>> sure we are doing the right thing.
>>>>
>>>> What do you think?
>>>>
>>>> best regards,
>>>> Luc
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message