Return-Path: X-Original-To: apmail-commons-dev-archive@www.apache.org Delivered-To: apmail-commons-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6E16C11E27 for ; Sun, 29 Jun 2014 16:49:33 +0000 (UTC) Received: (qmail 91168 invoked by uid 500); 29 Jun 2014 16:49:33 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 91047 invoked by uid 500); 29 Jun 2014 16:49:33 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 91027 invoked by uid 99); 29 Jun 2014 16:49:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Jun 2014 16:49:27 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of venkateshamurthyts@gmail.com designates 209.85.192.44 as permitted sender) Received: from [209.85.192.44] (HELO mail-qg0-f44.google.com) (209.85.192.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Jun 2014 16:49:24 +0000 Received: by mail-qg0-f44.google.com with SMTP id j107so1179842qga.3 for ; Sun, 29 Jun 2014 09:49:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=PJ0A59tBFOGPQgW7sLIuyptYtsIbQ/6TQH41FTBICsE=; b=x8KlYX7ZLOBrsn1i3OOe/lsXDLvP0453Uk/S245X//S4j2Hj3Gios14RhELy2kygDh LAP9mATNkhgkVKRKIvjX6UhLh4Ho0sT04yQUoy+S2AURapqybIbhVo3iYznwKa30R2VU VSDwSEUYdC8OPx7Aqnmt8oeHq57fWAea22V4tzKX1k7p1Yqnz/L+/J9p3PnvRxR2BAQN +ffdtCbG1M0MVZbyu0/bZDuYHPIbl85Vm3v9ojRj/jXUZ/u0kazUJ6v1DI69tBglT6ia EPFPfbqwickRrcmVyXOlQuvGh13KwTrxPVTsa02dPw/KovDUGgyLW97XaT/R/M1Sq42B 1/gA== MIME-Version: 1.0 X-Received: by 10.224.123.202 with SMTP id q10mr52137934qar.79.1404060539879; Sun, 29 Jun 2014 09:48:59 -0700 (PDT) Received: by 10.140.37.73 with HTTP; Sun, 29 Jun 2014 09:48:59 -0700 (PDT) In-Reply-To: <53AF1DA1.3050504@gmail.com> References: <53A87415.7060001@spaceroots.org> <53A9C8BA.9000204@spaceroots.org> <53AA7674.6080307@spaceroots.org> <53AF1DA1.3050504@gmail.com> Date: Sun, 29 Jun 2014 22:18:59 +0530 Message-ID: Subject: Re: [math] use case for NaNStrategy.FIXED? From: venkatesha murthy To: Commons Developers List Content-Type: multipart/alternative; boundary=047d7bdc8e20b907c504fcfc5195 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc8e20b907c504fcfc5195 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Sun, Jun 29, 2014 at 1:25 AM, Phil Steitz wrote: > On 6/25/14, 12:12 AM, Luc Maisonobe wrote: > > Le 25/06/2014 06:05, venkatesha murthy a =C3=A9crit : > >> On Wed, Jun 25, 2014 at 12:21 AM, Luc Maisonobe > wrote: > >> > >>> Hi Venkat, > >>> > >>> Le 23/06/2014 21:08, venkatesha murthy a =C3=A9crit : > >>>> On Tue, Jun 24, 2014 at 12:08 AM, Luc Maisonobe > >>> wrote: > >>>>> Hi all, > >>>>> > >>>>> While looking further in Percentile class for MATH-1120, I have fou= nd > >>>>> another problem in the current implementation. NaNStrategy.FIXED > should > >>>>> leave the NaNs in place, but at the end of the KthSelector.select > >>>>> method, a call to Arrays.sort moves the NaNs to the end of the smal= l > >>>>> sub-array. What is really implemented here is therefore closer to > >>>>> NaNStrategy.MAXIMAL than NaNStrategy.FIXED. This always occur in th= e > >>>>> test cases because they use very short arrays, and we directly > switch to > >>>>> this part of the select method. > >>>> Are NaNs considered higher than +Inf ? > >>>> If MAXIMAL represent replacing for +inf ; you need something to > >>>> indicate beyond this for NaN. > >>> Well, we can just keep the NaN themselves and handled them > >>> appropriately, hoping not to trash performances too much. > >>> > >> Agreed. > >> > >>>> What is the test input you see an issue and what is the actual error > >>>> you are seeing. Please share the test case. > >>> Just look at PercentileTest.testReplaceNanInRange(). The first check = in > >>> the test corresponds to a Percentile configuration at 50% percentile, > >>> and NaNStrategy.FIXED. The array has an odd number of entries, so the > >>> 50% percentile is exactly one of the entries: the one at index 5 in t= he > >>> final array. > >>> > >>> The initial ordering is { 0, 1, 2, 3, 4, NaN, NaN, 5, 7, NaN, 8 }. So > >>> for the NaNStrategy.FIXED setting, it should not be modified at all i= n > >>> the selection algorithm and the result for 50% should be the first Na= N > >>> of the array, at index 5. In fact, due to the Arrays.sort, we *do* > >>> reorder the array into { 0, 1, 2, 3, 4, 5, 7, 8, NaN, NaN, NaN }, so > >>> the result is 5. > >>> > >>> Agreed. just verified by putting back the earlier insertionSort > function. > >> > >>> If we use NaNStrategy.MAXIMAL and any quantile above 67%, we get as a > >>> result Double.POSITIVE_INFINITY instead of Double.NaN. > >>> > >>> If we agree to leave FIXED as unchanged behaviour with your > insertionSort > >> code; then atleast MAXIMAL/MINIMAL should be allowed for transformatio= n > of > >> NaN to +/-Inf > > I'm fine with it, as long as we clearly state it in the NaNStrategy > > documentation, saying somethig like: > > > > When MAXIMAL/MINIMAL are used, the NaNs are effecively *replaced* by > > +/- infinity, hence the results will be computed as if infinities were > > there in the first place. > > -1 - this is mixing concerns. NaNStrategy exists for one specific > purpose - to turn extend the ordering on doubles a total ordering. > Strategies prescribe only how NaNs are to be treated in the > ordering. Side effects on the input array don't make sense in this > context. The use case for which this class was created was rank > transformations. Returning infinities in rank transformations would > blow up computations in these classes. If a floating point > computations need to transform NaNs, the specific stats / other > classes that do this transformation should perform and document the > behavior. > > Phil > OK. Agreed realized it later. Hence i have not touched NaNStrategy in my patch(MATH-1132) . Now i have added a separate transformer to transform NaN= s but it uses NanStrategy. Please let me know on this as this trasnformation itself can be used in different places. > > >>>>> When I implemented the method, I explicitly avoided calling > Arrays.sort > >>>>> because it is a general purpose method that is know to be efficient > only > >>>>> for arrays of at least medium size. In most cases, when the array i= s > >>>>> small one falls back to a non-recursive method like a very simple > >>>>> insertion sort, which is faster for smaller arrays. > >>>> Please help me understand here; even java primitive Arrays.sort does > >>>> an insertion sort for less than 7 elements > >>>> (Refer sort1(double x[], int off, int len)) > >>>> So what is it that the custom insertion sort that does differently o= r > >>>> is advantageous. Is it the value 15 elements? > >>> I don't see a reference to 7 elements, neither in the Java6 nor in th= e > >>> Java 7 doc > >> Please take a look at the sort1 method where there is a first block in > the > >> code which clearly mentions len < 7 > >> /** > >> * Sorts the specified sub-array of doubles into ascending order. > >> */ > >> private static void sort1(double x[], int off, int len) { > >> // Insertion sort on smallest arrays > >> if (len < 7) { > >> for (int i=3Doff; i >> for (int j=3Di; j>off && x[j-1]>x[j]; j--) > >> swap(x, j, j-1); > >> return; > >> } > >> : > >> : > >> : code continues for the else part > >> > >> Also the grepcode url > >> < > http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-= b14/java/util/Arrays.java#Arrays.sort1%28double[]%2Cint%2Cint%29 > > > >> indicates the same > > OK, you were refering to a specific implementation. I was thinking in > > the general case. > > > >> (and in any case the doc explicitly states the algorithms > >>> explained are only implementation notes and are not part of the > >>> specification). > >>> > >> Yes its a part of comments anyways. > >> > >>> However, the most important part for now is the fact that we control = it > >>> and may be able to implement different NaN strategies. What we have > >>> currently fails. > >>> > >>> I agree on this and hence here is my take: > >> Leave FIXED as-is and use the earlier insertionSort code (just change > the > >> name to sort rather than hardcoding it as insertionsort) to handle the > case > >> you were mentioning > > If we leave FIXED it as it is done know, even with insertionSort we do > > not really control what happens. It is deterministic as the pivoting an= d > > if/else structure is well defined (both in the selection part and in th= e > > final sort for sub-arrays), but it is untrackable so we can't document > it. > > > >> Continue to use MAXIMAL/MINIMAL for +/-Inf transformation and that way > we > >> have covered both Inf and nan cases. > > OK. > > > >> Use REMOVED as default for all Percentile Estimation Types. (mostly > >> influenced by R here perhaps) > > This would be fine with me. I have concerns with FIXED only, the other > > strategies all seem quite realistic. > > > > Does anybody else has an advice for FIXED? What are the use case? > > > > best regards, > > Luc > > > >> best regards, > >>> Luc > >>> > >>>>> In the select > >>>>> operation, we know we have small sub-arrays at the call point. Goin= g > >>>>> back to the former insertionSort would recover the good behavior fo= r > >>>>> small arrays, but would in fact not be sufficient to really > implement a > >>>>> NaNStrategy.FIXED. I guess it would be simple to make it behave lik= e > >>>>> NaNStrategy.MAXIMAL but I did not try yet. > >>>>> > >>>>> My point here is rather: can we really and should we really impleme= nt > >>>>> NaNStrategy.FIXED? Looking at how it is used elsewhere, it needs to > >>>>> store the original position of the NaNs. It is quite cumbersome. > >>>>> > >>>>> I wonder what is the use case for NaNStrategy.FIXED at all. > >>>>> > >>>>> Going further and looking at the use cases for other NaNStrategy > values, > >>>>> the NaNs are replaced by +/- infinity before sorting, which is OK f= or > >>>>> ranking as we only rely on the indices, but we use the values > themselves > >>>>> in Percentile. So sometimes, we end up with computing interpolation= s > >>>>> like 0.5 * (x[k] + x[k+1]) or similar. If x[k] is a finite number a= nd > >>>>> x[k+1] has been changed to +infinity, we get +infinity, instead of > the > >>>>> NaN we should have retrieved without replacement. So here again, I'= m > not > >>>>> sure we are doing the right thing. > >>>>> > >>>>> What do you think? > >>>>> > >>>>> best regards, > >>>>> Luc > >>>>> > >>>>> -------------------------------------------------------------------= -- > >>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org > >>>>> For additional commands, e-mail: dev-help@commons.apache.org > >>>>> > >>>> --------------------------------------------------------------------= - > >>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org > >>>> For additional commands, e-mail: dev-help@commons.apache.org > >>>> > >>>> > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org > >>> For additional commands, e-mail: dev-help@commons.apache.org > >>> > >>> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org > > For additional commands, e-mail: dev-help@commons.apache.org > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org > For additional commands, e-mail: dev-help@commons.apache.org > > --047d7bdc8e20b907c504fcfc5195--