Mailing-List: contact dev-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Commons Developers List" <dev@commons.apache.org>
Received-SPF: pass (nike.apache.org: domain of venkateshamurthyts@gmail.com
 designates 209.85.192.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <53AF1DA1.3050504@gmail.com>
References: <53A87415.7060001@spaceroots.org>
	<CAE0LGD3UcODLT9fM0v2ZYwquPUcgC0_Jy1nJ=PS+kKDk4gtjOA@mail.gmail.com>
	<53A9C8BA.9000204@spaceroots.org>
	<CAE0LGD27esw+KfTz5fYB+1=qLMahMZnUkSdLujP9br7hoxaU6g@mail.gmail.com>
	<53AA7674.6080307@spaceroots.org>
	<53AF1DA1.3050504@gmail.com>
Date: Sun, 29 Jun 2014 22:18:59 +0530
Message-ID: 
 <CAE0LGD3=18q-sqF9EKcHzvy-1XmT17ikySX+mMWM6bd3fQwUPw@mail.gmail.com>
Subject: Re: [math] use case for NaNStrategy.FIXED?
From: venkatesha murthy <venkateshamurthyts@gmail.com>
To: Commons Developers List <dev@commons.apache.org>
Content-Type: multipart/alternative; boundary=047d7bdc8e20b907c504fcfc5195

--047d7bdc8e20b907c504fcfc5195
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Sun, Jun 29, 2014 at 1:25 AM, Phil Steitz <phil.steitz@gmail.com> wrote:

> On 6/25/14, 12:12 AM, Luc Maisonobe wrote:
> > Le 25/06/2014 06:05, venkatesha murthy a =C3=A9crit :
> >> On Wed, Jun 25, 2014 at 12:21 AM, Luc Maisonobe <luc@spaceroots.org>
> wrote:
> >>
> >>> Hi Venkat,
> >>>
> >>> Le 23/06/2014 21:08, venkatesha murthy a =C3=A9crit :
> >>>> On Tue, Jun 24, 2014 at 12:08 AM, Luc Maisonobe <luc@spaceroots.org>
> >>> wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> While looking further in Percentile class for MATH-1120, I have fou=
nd
> >>>>> another problem in the current implementation. NaNStrategy.FIXED
> should
> >>>>> leave the NaNs in place, but at the end of the KthSelector.select
> >>>>> method, a call to Arrays.sort moves the NaNs to the end of the smal=
l
> >>>>> sub-array. What is really implemented here is therefore closer to
> >>>>> NaNStrategy.MAXIMAL than NaNStrategy.FIXED. This always occur in th=
e
> >>>>> test cases because they use very short arrays, and we directly
> switch to
> >>>>> this part of the select method.
> >>>> Are NaNs considered higher than +Inf ?
> >>>> If MAXIMAL represent replacing for +inf ; you need something to
> >>>> indicate beyond this for NaN.
> >>> Well, we can just keep the NaN themselves and handled them
> >>> appropriately, hoping not to trash performances too much.
> >>>
> >> Agreed.
> >>
> >>>> What is the test input you see an issue and what is the actual error
> >>>> you are seeing. Please share the test case.
> >>> Just look at PercentileTest.testReplaceNanInRange(). The first check =
in
> >>> the test corresponds to a Percentile configuration at 50% percentile,
> >>> and NaNStrategy.FIXED. The array has an odd number of entries, so the
> >>> 50% percentile is exactly one of the entries: the one at index 5 in t=
he
> >>> final array.
> >>>
> >>> The initial ordering is { 0, 1, 2, 3, 4, NaN, NaN, 5, 7, NaN, 8 }. So
> >>> for the NaNStrategy.FIXED setting, it should not be modified at all i=
n
> >>> the selection algorithm and the result for 50% should be the first Na=
N
> >>> of the array, at index 5. In fact, due to the Arrays.sort, we *do*
> >>> reorder the array into  { 0, 1, 2, 3, 4, 5, 7, 8, NaN, NaN, NaN }, so
> >>> the result is 5.
> >>>
> >>> Agreed. just verified by putting back the earlier insertionSort
> function.
> >>
> >>> If we use NaNStrategy.MAXIMAL and any quantile above 67%, we get as a
> >>> result Double.POSITIVE_INFINITY instead of Double.NaN.
> >>>
> >>> If we agree to leave FIXED as unchanged behaviour with your
> insertionSort
> >> code; then atleast MAXIMAL/MINIMAL should be allowed for transformatio=
n
> of
> >> NaN to +/-Inf
> > I'm fine with it, as long as we clearly state it in the NaNStrategy
> > documentation, saying somethig like:
> >
> >  When MAXIMAL/MINIMAL are used, the NaNs are effecively *replaced* by
> >  +/- infinity, hence the results will be computed as if infinities were
> >  there in the first place.
>
> -1 - this is mixing concerns.  NaNStrategy exists for one specific
> purpose - to turn extend the ordering on doubles a total ordering.
> Strategies prescribe only how NaNs are to be treated in the
> ordering.  Side effects on the input array don't make sense in this
> context.  The use case for which this class was created was rank
> transformations.  Returning infinities in rank transformations would
> blow up computations in these classes.  If a floating point
> computations need to transform NaNs, the specific stats / other
> classes that do this transformation should perform and document the
> behavior.
>
> Phil
>
OK. Agreed realized it later.  Hence i have not touched NaNStrategy in my
patch(MATH-1132) . Now i have added a separate transformer to transform NaN=
s
but it uses NanStrategy.  Please let me know on this as this trasnformation
itself
can be used in different places.

>
> >>>>> When I implemented the method, I explicitly avoided calling
> Arrays.sort
> >>>>> because it is a general purpose method that is know to be efficient
> only
> >>>>> for arrays of at least medium size. In most cases, when the array i=
s
> >>>>> small one falls back to a non-recursive method like a very simple
> >>>>> insertion sort, which is faster for smaller arrays.
> >>>> Please help me understand here; even java primitive Arrays.sort does
> >>>> an insertion sort for less than 7 elements
> >>>> (Refer sort1(double x[], int off, int len))
> >>>> So what is it that the custom insertion sort that does differently o=
r
> >>>> is advantageous. Is it the value 15 elements?
> >>> I don't see a reference to 7 elements, neither in the Java6 nor in th=
e
> >>> Java 7 doc
> >> Please take a look at the sort1 method where there is a first block in
> the
> >> code which clearly mentions len < 7
> >>     /**
> >>      * Sorts the specified sub-array of doubles into ascending order.
> >>      */
> >>     private static void sort1(double x[], int off, int len) {
> >>     // Insertion sort on smallest arrays
> >>     if (len < 7) {
> >>         for (int i=3Doff; i<len+off; i++)
> >>         for (int j=3Di; j>off && x[j-1]>x[j]; j--)
> >>             swap(x, j, j-1);
> >>         return;
> >>     }
> >>     :
> >>     :
> >>     : code continues for the else part
> >>
> >> Also the grepcode url
> >> <
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-=
b14/java/util/Arrays.java#Arrays.sort1%28double[]%2Cint%2Cint%29
> >
> >> indicates the same
> > OK, you were refering to a specific implementation. I was thinking in
> > the general case.
> >
> >> (and in any case the doc explicitly states the algorithms
> >>> explained are only implementation notes and are not part of the
> >>> specification).
> >>>
> >> Yes its a part of comments anyways.
> >>
> >>> However, the most important part for now is the fact that we control =
it
> >>> and may be able to implement different NaN strategies. What we have
> >>> currently fails.
> >>>
> >>> I agree on this  and hence here is my take:
> >> Leave FIXED as-is and use the earlier insertionSort code (just change
> the
> >> name to sort rather than hardcoding it as insertionsort) to handle the
> case
> >> you were mentioning
> > If we leave FIXED it as it is done know, even with insertionSort we do
> > not really control what happens. It is deterministic as the pivoting an=
d
> > if/else structure is well defined (both in the selection part and in th=
e
> > final sort for sub-arrays), but it is untrackable so we can't document
> it.
> >
> >> Continue to use MAXIMAL/MINIMAL for +/-Inf transformation and that way
> we
> >> have covered both Inf and nan cases.
> > OK.
> >
> >> Use REMOVED as default for all Percentile Estimation Types. (mostly
> >> influenced by R here perhaps)
> > This would be fine with me. I have concerns with FIXED only, the other
> > strategies all seem quite realistic.
> >
> > Does anybody else has an advice for FIXED? What are the use case?
> >
> > best regards,
> > Luc
> >
> >> best regards,
> >>> Luc
> >>>
> >>>>> In the select
> >>>>> operation, we know we have small sub-arrays at the call point. Goin=
g
> >>>>> back to the former insertionSort would recover the good behavior fo=
r
> >>>>> small arrays, but would in fact not be sufficient to really
> implement a
> >>>>> NaNStrategy.FIXED. I guess it would be simple to make it behave lik=
e
> >>>>> NaNStrategy.MAXIMAL but I did not try yet.
> >>>>>
> >>>>> My point here is rather: can we really and should we really impleme=
nt
> >>>>> NaNStrategy.FIXED? Looking at how it is used elsewhere, it needs to
> >>>>> store the original position of the NaNs. It is quite cumbersome.
> >>>>>
> >>>>> I wonder what is the use case for NaNStrategy.FIXED at all.
> >>>>>
> >>>>> Going further and looking at the use cases for other NaNStrategy
> values,
> >>>>> the NaNs are replaced by +/- infinity before sorting, which is OK f=
or
> >>>>> ranking as we only rely on the indices, but we use the values
> themselves
> >>>>> in Percentile. So sometimes, we end up with computing interpolation=
s
> >>>>> like 0.5 * (x[k] + x[k+1]) or similar. If x[k] is a finite number a=
nd
> >>>>> x[k+1] has been changed to +infinity, we get +infinity, instead of
> the
> >>>>> NaN we should have retrieved without replacement. So here again, I'=
m
> not
> >>>>> sure we are doing the right thing.
> >>>>>
> >>>>> What do you think?
> >>>>>
> >>>>> best regards,
> >>>>> Luc
> >>>>>
> >>>>> -------------------------------------------------------------------=
--
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>>>> For additional commands, e-mail: dev-help@commons.apache.org
> >>>>>
> >>>> --------------------------------------------------------------------=
-
> >>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>>> For additional commands, e-mail: dev-help@commons.apache.org
> >>>>
> >>>>
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>> For additional commands, e-mail: dev-help@commons.apache.org
> >>>
> >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

--047d7bdc8e20b907c504fcfc5195--