commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Venkatesha Murthy TS (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MATH-1120) Need Percentile computations that can be matched with standard spreadsheet formula
Date Wed, 18 Jun 2014 18:20:24 GMT

     [ https://issues.apache.org/jira/browse/MATH-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Venkatesha Murthy TS updated MATH-1120:
---------------------------------------

    Attachment: 18-jun-percentile-with-estimation-patch

Hi Luc, Gilles

First of, Iam immensly thankful to all your comments on this patch. Next, i am attaching my
new patch with today's date(18-jun). However please advise if i need to remove the old patch
file if it confuses.

Please find my response below. The new patch has the suggested changes in the switch case
for nan handling; But; However i have my view points on the different default nan strategies
associated to types. Please permit me to explain (sorry for long summary)

First, 
i would like to leave the Default implementation of Percentile as-is (Meaning in my MATH-1120
patch it is mapped to Type.CM)since otherwise we will break user old expectation even for
non nan and non inf entries as well. The existing tests does fail if we change the default
types (please refer to PercentileTest.java code as well to see the finer variations that is
being looked at for different types)

Secondly,
Percentile.java header comment states somewhere to an effect that NaNs would be (left as-is
and) handled by java's default sort behavior and no removal being done. So for me to map this
behaviour to new implementation; it was NaNStrategy.FIXED that came close and didnt require
any of the existing test cases for the existing Percentile behavior to change. What i am re-iterating
here is the existing behavior tests have completely passed with new Type.CM and FIXED. (And
now i have added several more tests including different types as well).

Thirdly,
While all the R_x (where x :[1-9])  types as run and verified by R tool; seemed to clearly
convey the NaNs needed to be removed and hence you see that i have used different strategy
NaNStrategy#REMOVED. 
I agree while multiple defaults are not wise to have ; however; if we are forced to have Apache
CM as supported type (which is not one of R_x types) and we have the need to support multiple
variants (R1- R9) ; then it is inevitable to have type sepcific NaNStrategy as per the need.
I also feel ; NaN handling should be allowed for overriding atleast in a controlled manner
as different use cases may exist for needing this variation in nan handling. Therefore IMO
while we could avoid the public access to change these defaults; it is relevant to support
these variations of nan handling on a per type and allow atleast sub classes to override if
a rare need arises.
While the very name NaNStrategy reminds me of  different ways to look at that; i feel we will
be much restricted  if we just said that we stick to one way of NaNHandling for all types.
Please let me know your thoughts.

Next,
Regarding the PivotingStrategy; At first, i wanted to convey here that to have all the partitioning,
pivoting and selection in separated classes/enums than inside main class. I have made it as
static due to the fact that; it is more of a non-functional requirement and felt that it need
not be set for every instance (more of a global setting that doesnt vary across types). 
Please correct me here and let know if it still needs to be per instance. 
I also made it package accessable/settable solely because medianOf3 method had been package
level for the sole intent of possible overriding of the same within that package. Meaning;
if some one really needed to tinker around pivoting they need a way to do it which i have
provided it using a strategy.

Next,
In the current patch that i am going to attach as new dated patch (since you have already
started looking at the old one ; which i would leave it as is). I find many utility type methods;
replaceNaN, removeNaN( Predicated Lists ) and copyOf(values, begin.length) and as well as
KthSelector with PivotingStrategy etc all of which can perhaps make its way to MathArrays
and MathUtils. Please let me know.If so i will once again re-factor these changes and submit
the patch.

Thanks for reading this through and for your time in reviewing . Please let me know your opinion
on all of these.

thanks
venkat.


> Need Percentile computations that can be matched with standard spreadsheet formula
> ----------------------------------------------------------------------------------
>
>                 Key: MATH-1120
>                 URL: https://issues.apache.org/jira/browse/MATH-1120
>             Project: Commons Math
>          Issue Type: Improvement
>    Affects Versions: 3.2
>            Reporter: Venkatesha Murthy TS
>              Labels: Percentile
>             Fix For: 4.0
>
>         Attachments: 18-jun-percentile-with-estimation-patch, excel-percentile-patch,
percentile-with-estimation-patch, r-output.txt
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The current Percentile implementation assumes and hard-codes the quantile pth position
as 
> p * (N+1)/100 and provides a kth selected value.
> However if we need to verify compare/contrast with standard statistical tools such as
say MS Excel; it would be good to provide an extensible way of morphing this selection of
position than hard code.
> For example in order to generate the percentile closely matching with MS Excel the position
required may be [p*(N-1)/100]+1.
> Please let me know if i could submit this as a patch.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message