commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luc Maisonobe <...@spaceroots.org>
Subject Re: [math] use case for NaNStrategy.FIXED?
Date Wed, 25 Jun 2014 07:15:03 GMT
Le 25/06/2014 06:10, venkatesha murthy a écrit :
> On Wed, Jun 25, 2014 at 9:35 AM, venkatesha murthy <
> venkateshamurthyts@gmail.com> wrote:
> Can i put a patch for this change?

I think we should still discuss about FIXED.

If you want to set up u patch for the documentation of NaNStrategy
explaining the replacement done in MAXIMAL/MINIMAL, please go ahead.
You should put it in a new JIRA issue for clarity.

best regards,
Luc

> 
>>
>> On Wed, Jun 25, 2014 at 12:21 AM, Luc Maisonobe <luc@spaceroots.org>
>> wrote:
>>
>>> Hi Venkat,
>>>
>>> Le 23/06/2014 21:08, venkatesha murthy a écrit :
>>>> On Tue, Jun 24, 2014 at 12:08 AM, Luc Maisonobe <luc@spaceroots.org>
>>> wrote:
>>>>> Hi all,
>>>>>
>>>>> While looking further in Percentile class for MATH-1120, I have found
>>>>> another problem in the current implementation. NaNStrategy.FIXED should
>>>>> leave the NaNs in place, but at the end of the KthSelector.select
>>>>> method, a call to Arrays.sort moves the NaNs to the end of the small
>>>>> sub-array. What is really implemented here is therefore closer to
>>>>> NaNStrategy.MAXIMAL than NaNStrategy.FIXED. This always occur in the
>>>>> test cases because they use very short arrays, and we directly switch
>>> to
>>>>> this part of the select method.
>>>> Are NaNs considered higher than +Inf ?
>>>> If MAXIMAL represent replacing for +inf ; you need something to
>>>> indicate beyond this for NaN.
>>>
>>> Well, we can just keep the NaN themselves and handled them
>>> appropriately, hoping not to trash performances too much.
>>>
>> Agreed.
>>
>>>
>>>> What is the test input you see an issue and what is the actual error
>>>> you are seeing. Please share the test case.
>>>
>>> Just look at PercentileTest.testReplaceNanInRange(). The first check in
>>> the test corresponds to a Percentile configuration at 50% percentile,
>>> and NaNStrategy.FIXED. The array has an odd number of entries, so the
>>> 50% percentile is exactly one of the entries: the one at index 5 in the
>>> final array.
>>>
>>> The initial ordering is { 0, 1, 2, 3, 4, NaN, NaN, 5, 7, NaN, 8 }. So
>>> for the NaNStrategy.FIXED setting, it should not be modified at all in
>>> the selection algorithm and the result for 50% should be the first NaN
>>> of the array, at index 5. In fact, due to the Arrays.sort, we *do*
>>> reorder the array into  { 0, 1, 2, 3, 4, 5, 7, 8, NaN, NaN, NaN }, so
>>> the result is 5.
>>>
>>> Agreed. just verified by putting back the earlier insertionSort function.
>>
>>
>>> If we use NaNStrategy.MAXIMAL and any quantile above 67%, we get as a
>>> result Double.POSITIVE_INFINITY instead of Double.NaN.
>>>
>>> If we agree to leave FIXED as unchanged behaviour with your insertionSort
>> code; then atleast MAXIMAL/MINIMAL should be allowed for transformation of
>> NaN to +/-Inf
>>
>>>  >>
>>>>> When I implemented the method, I explicitly avoided calling Arrays.sort
>>>>> because it is a general purpose method that is know to be efficient
>>> only
>>>>> for arrays of at least medium size. In most cases, when the array is
>>>>> small one falls back to a non-recursive method like a very simple
>>>>> insertion sort, which is faster for smaller arrays.
>>>>
>>>> Please help me understand here; even java primitive Arrays.sort does
>>>> an insertion sort for less than 7 elements
>>>> (Refer sort1(double x[], int off, int len))
>>>> So what is it that the custom insertion sort that does differently or
>>>> is advantageous. Is it the value 15 elements?
>>>
>>> I don't see a reference to 7 elements, neither in the Java6 nor in the
>>> Java 7 doc
>>
>> Please take a look at the sort1 method where there is a first block in the
>> code which clearly mentions len < 7
>>     /**
>>      * Sorts the specified sub-array of doubles into ascending order.
>>      */
>>     private static void sort1(double x[], int off, int len) {
>>     // Insertion sort on smallest arrays
>>     if (len < 7) {
>>         for (int i=off; i<len+off; i++)
>>         for (int j=i; j>off && x[j-1]>x[j]; j--)
>>             swap(x, j, j-1);
>>         return;
>>     }
>>     :
>>     :
>>     : code continues for the else part
>>
>> Also the grepcode url
>> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/Arrays.java#Arrays.sort1%28double[]%2Cint%2Cint%29>
>> indicates the same
>>
>> (and in any case the doc explicitly states the algorithms
>>> explained are only implementation notes and are not part of the
>>> specification).
>>>
>> Yes its a part of comments anyways.
>>
>>>
>>> However, the most important part for now is the fact that we control it
>>> and may be able to implement different NaN strategies. What we have
>>> currently fails.
>>>
>>> I agree on this  and hence here is my take:
>> Leave FIXED as-is and use the earlier insertionSort code (just change the
>> name to sort rather than hardcoding it as insertionsort) to handle the case
>> you were mentioning
>> Continue to use MAXIMAL/MINIMAL for +/-Inf transformation and that way we
>> have covered both Inf and nan cases.
>> Use REMOVED as default for all Percentile Estimation Types. (mostly
>> influenced by R here perhaps)
>>
>> best regards,
>>> Luc
>>>
>>>>
>>>>> In the select
>>>>> operation, we know we have small sub-arrays at the call point. Going
>>>>> back to the former insertionSort would recover the good behavior for
>>>>> small arrays, but would in fact not be sufficient to really implement
a
>>>>> NaNStrategy.FIXED. I guess it would be simple to make it behave like
>>>>> NaNStrategy.MAXIMAL but I did not try yet.
>>>>>
>>>>> My point here is rather: can we really and should we really implement
>>>>> NaNStrategy.FIXED? Looking at how it is used elsewhere, it needs to
>>>>> store the original position of the NaNs. It is quite cumbersome.
>>>>>
>>>>> I wonder what is the use case for NaNStrategy.FIXED at all.
>>>>>
>>>>> Going further and looking at the use cases for other NaNStrategy
>>> values,
>>>>> the NaNs are replaced by +/- infinity before sorting, which is OK for
>>>>> ranking as we only rely on the indices, but we use the values
>>> themselves
>>>>> in Percentile. So sometimes, we end up with computing interpolations
>>>>> like 0.5 * (x[k] + x[k+1]) or similar. If x[k] is a finite number and
>>>>> x[k+1] has been changed to +infinity, we get +infinity, instead of the
>>>>> NaN we should have retrieved without replacement. So here again, I'm
>>> not
>>>>> sure we are doing the right thing.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> best regards,
>>>>> Luc
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message