commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luc Maisonobe <...@spaceroots.org>
Subject Re: [math] use case for NaNStrategy.FIXED?
Date Tue, 01 Jul 2014 06:52:43 GMT
Le 30/06/2014 22:00, Phil Steitz a écrit :
> On 6/30/14, 12:33 PM, Luc Maisonobe wrote:
>> Le 30/06/2014 19:15, Phil Steitz a écrit :
>>>
>>>> On Jun 30, 2014, at 9:59 AM, Gilles <gilles@harfang.homelinux.org>
>>>> wrote:
>>>>
>>>> On Mon, 30 Jun 2014 09:08:00 -0700, Phil Steitz wrote:
>>>>>> On Jun 29, 2014, at 3:41 PM, Gilles
>>>>>> <gilles@harfang.homelinux.org> wrote:
>>>>>>
>>>>>>>> On Sun, 29 Jun 2014 14:39:51 -0700, Phil Steitz wrote:
>>>>>>>>> On 6/29/14, 2:30 PM, Gilles wrote:
>>>>>>>>>> On Sun, 29 Jun 2014 10:25:58 -0700, Phil Steitz wrote:

>>>>>>>>>> On 6/29/14, 9:48 AM, venkatesha murthy wrote: On
Sun,
>>>>>>>>>> Jun 29, 2014 at 1:25 AM, Phil Steitz 
>>>>>>>>>> <phil.steitz@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>> On 6/25/14, 12:12 AM, Luc Maisonobe wrote:
Le
>>>>>>>>>>>> 25/06/2014 06:05, venkatesha murthy a écrit
:
>>>>>>>>>>>>> On Wed, Jun 25, 2014 at 12:21 AM, Luc
Maisonobe 
>>>>>>>>>>>>> <luc@spaceroots.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi Venkat,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Le 23/06/2014 21:08, venkatesha murthy
a écrit
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>> On Tue, Jun 24, 2014 at 12:08
AM, Luc
>>>>>>>>>>>>>>> Maisonobe <luc@spaceroots.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> While looking further in
Percentile class
>>>>>>>>>>>>>>>> for MATH-1120, I have found
another problem
>>>>>>>>>>>>>>>> in the current implementation.

>>>>>>>>>>>>>>>> NaNStrategy.FIXED
>>>>>>>>>>> should
>>>>>>>>>>>>>>>> leave the NaNs in place,
but at the end of
>>>>>>>>>>>>>>>> the KthSelector.select method,
a call to
>>>>>>>>>>>>>>>> Arrays.sort moves the NaNs
to the end of 
>>>>>>>>>>>>>>>> the small sub-array. What
is really
>>>>>>>>>>>>>>>> implemented here is therefore
closer to 
>>>>>>>>>>>>>>>> NaNStrategy.MAXIMAL than
NaNStrategy.FIXED.
>>>>>>>>>>>>>>>> This always occur in the
test cases because
>>>>>>>>>>>>>>>> they use very short arrays,
and we 
>>>>>>>>>>>>>>>> directly
>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>> this part of the select method.
>>>>>>>>>>>>>>> Are NaNs considered higher than
+Inf ? If
>>>>>>>>>>>>>>> MAXIMAL represent replacing for
+inf ; you
>>>>>>>>>>>>>>> need something to indicate beyond
this for
>>>>>>>>>>>>>>> NaN.
>>>>>>>>>>>>>> Well, we can just keep the NaN themselves
and
>>>>>>>>>>>>>> handled them appropriately, hoping
not to trash
>>>>>>>>>>>>>> performances too much.
>>>>>>>>>>>>> Agreed.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What is the test input you see
an issue and
>>>>>>>>>>>>>>> what is the actual error you
are seeing.
>>>>>>>>>>>>>>> Please share the test case.
>>>>>>>>>>>>>> Just look at
>>>>>>>>>>>>>> PercentileTest.testReplaceNanInRange().
The 
>>>>>>>>>>>>>> first check in the test corresponds
to a
>>>>>>>>>>>>>> Percentile configuration at 50% percentile,
and
>>>>>>>>>>>>>> NaNStrategy.FIXED. The array has
an odd number
>>>>>>>>>>>>>> of entries, so the 50% percentile
is exactly
>>>>>>>>>>>>>> one of the entries: the one at index
5 in the 
>>>>>>>>>>>>>> final array.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The initial ordering is { 0, 1, 2,
3, 4, NaN,
>>>>>>>>>>>>>> NaN, 5, 7, NaN, 8 }. So for the
>>>>>>>>>>>>>> NaNStrategy.FIXED setting, it should
not be
>>>>>>>>>>>>>> modified at all in the selection
algorithm and
>>>>>>>>>>>>>> the result for 50% should be the
first NaN of
>>>>>>>>>>>>>> the array, at index 5. In fact, due
to the
>>>>>>>>>>>>>> Arrays.sort, we *do* reorder the
array into  {
>>>>>>>>>>>>>> 0, 1, 2, 3, 4, 5, 7, 8, NaN, NaN,
NaN }, so the
>>>>>>>>>>>>>> result is 5.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Agreed. just verified by putting
back the
>>>>>>>>>>>>>> earlier insertionSort
>>>>>>>>>>> function.
>>>>>>>>>>>>>> If we use NaNStrategy.MAXIMAL and
any quantile
>>>>>>>>>>>>>> above 67%, we get as a result
>>>>>>>>>>>>>> Double.POSITIVE_INFINITY instead
of
>>>>>>>>>>>>>> Double.NaN.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we agree to leave FIXED as unchanged
>>>>>>>>>>>>>> behaviour with your
>>>>>>>>>>> insertionSort
>>>>>>>>>>>>> code; then atleast MAXIMAL/MINIMAL should
be
>>>>>>>>>>>>> allowed for transformation
>>>>>>>>>>> of
>>>>>>>>>>>>> NaN to +/-Inf
>>>>>>>>>>>> I'm fine with it, as long as we clearly state
it in
>>>>>>>>>>>> the NaNStrategy documentation, saying somethig
>>>>>>>>>>>> like:
>>>>>>>>>>>>
>>>>>>>>>>>> When MAXIMAL/MINIMAL are used, the NaNs are
>>>>>>>>>>>> effecively *replaced* by +/- infinity, hence
the
>>>>>>>>>>>> results will be computed as if infinities
were 
>>>>>>>>>>>> there in the first place.
>>>>>>>>>>> -1 - this is mixing concerns.  NaNStrategy exists
for
>>>>>>>>>>> one specific purpose - to turn extend the ordering
on
>>>>>>>>>>> doubles a total ordering. Strategies prescribe
only
>>>>>>>>>>> how NaNs are to be treated in the ordering. 
Side
>>>>>>>>>>> effects on the input array don't make sense in
this 
>>>>>>>>>>> context.  The use case for which this class was
>>>>>>>>>>> created was rank transformations.  Returning
>>>>>>>>>>> infinities in rank transformations would blow
up
>>>>>>>>>>> computations in these classes.  If a floating
point 
>>>>>>>>>>> computations need to transform NaNs, the specific
>>>>>>>>>>> stats / other classes that do this transformation
>>>>>>>>>>> should perform and document the behavior.
>>>>>>>>>>>
>>>>>>>>>>> Phil
>>>>>>>>>> OK. Agreed realized it later.  Hence i have not
>>>>>>>>>> touched NaNStrategy in my patch(MATH-1132) . Now
i have
>>>>>>>>>> added a separate transformer to transform NaNs but
it
>>>>>>>>>> uses NanStrategy.  Please let me know on this as
this 
>>>>>>>>>> trasnformation itself can be used in different places.
>>>>>>>>> Honestly, I don't see the value of this.  I would be
more
>>>>>>>>> favorable to an addition to MathArrays supporting NaN
(or
>>>>>>>>> infinity) filtering / transformation.  Several years
back
>>>>>>>>> we made the decision to just let NaNs propagate in
>>>>>>>>> computations, which basically means that if you feed
NaNs
>>>>>>>>> to [math] you are going to get NaNs out in results. If
we
>>>>>>>>> do get into the business of filtering NaNs (or infinities
>>>>>>>>> for that matter), I think it is probably best to do that
>>>>>>>>> in MathArrays or somesuch - i.e., don't decorate APIs
>>>>>>>>> with NaN or infinity-handling handling descriptions.
>>>>>>>>> That would be a huge task and would likely end up
>>>>>>>>> bleeding implementation detail into the APIs.  As I said
>>>>>>>>> above, NaNStrategy has to exist for rank transformations
>>>>>>>>> to be well-defined.  NaN-handling in floating point 
>>>>>>>>> computations is defined and as long as we fully document
>>>>>>>>> what our algorithms do, I think it is fair enough to
"let
>>>>>>>>> the NaNs fall where they may" - which basically means
>>>>>>>>> users need to do the filtering / transformation
>>>>>>>>> themselves.
>>>>>>>> So, do I understand correctly that it's OK to add the
>>>>>>>> "remove" and "replace" utility methods (cf. MATH-1130) in
>>>>>>>> "MathArrays"? Or does this thread conclude that we don't
>>>>>>>> have a use for this within Commons Math?
>>>>>>> You raise a good point here.  Personally, I don't see an
>>>>>>> internal use and if we are sticking with MathArrays including
>>>>>>> only things we have internal use for, I would say no, don't
>>>>>>> include them.
>>>>>> A third way to look at it would be that since some algorithms
>>>>>> require being passed "clean" data (e.g. without NaNs), we could
>>>>>> provide some simple means to do so. A user could then write,
>>>>>> e.g. --- double[] data = { 1, Double.NaN, 3, 6, 5 };
>>>>>>
>>>>>> Percentile p = new Percentile(); double q =
>>>>>> p.evaluate(MathArrays.replace(data, Double.NaN,
>>>>>> Double.POSITIVE_INFINITY), 0.1); ---
>>>>>>
>>>>>> Then, if we provide that utility, is it still necessary to
>>>>>> complicate the Percentile code with a NaNStrategy parameter?
>>>>> I may be missing something, but it seems to me that percentile
>>>>> does need a NaNStrategy in the presence if NaNs like the other
>>>>> order statistics do (basically to define their position In the
>>>>> ordering).  It doesn't logically need more than that.  I would
>>>>> be ok defaulting the strategy or even returning NaN in the
>>>>> presence of NaNs (or whenever interpolation involves a NaN). All
>>>>> of this can be documented.
>>>> I am _surely_ missing something because I don't figure out how NaNs
>>>> can be useful when computing those statistics... :-}
>>> Good point.  Basically the different strategies allow you to
>>> effectively ignore or work around them so stats can be well-defined.
>>> Without NaNStrategy, order statistics including NaNs are undefined.
>> The problem with Percentile is that the elements are both reordered
>> *and* used when interpolating between two values. Tor the rank methods,
>> only reordering was needed so the various NaNStrategies could be
>> implemented by replacing at will with +/- infinity and a classical sort
>> would do the ordering for us.
>>
>> Here, we do interpolate on the values, so if we first replace NaN with
>> +infinity to put them at the end, we may interpolate between a finite
>> value and +inf and have +inf as a result, whereas if the NaN "value" was
>> preserved and some magic used to sort the NaN, we would end up
>> interpolating with a NaN and get a NaN as a result. Setting up the magic
>> for MINIMAL and MAXIMAL seems achievable withing the sort, as we already
>> have a specific sort (in fact not a sort but a selection algorithm,
>> which is the core of the class). Setting up the magic for FIXED seems
>> more difficult.
>>
>> So I guess implementing NaNStrategies may be done in different ways
>> depending on how they will be used by the calling algorithm. I don't yet
>> know if we can set up a general method to be used for all statistics.
>> Looking only at Percentile, replacement with +/- infinity already seems
>> to not be an appropriate solution.
> 
> I don't think we should try to redefine algorithms so they can
> handle NaNs.  When you do floating point computations with NaNs, you
> get NaNs.  That should be understood to be the contract.  For rank
> statistics, such as Percentile, we need to extend the ordering on
> doubles to include them in ordering for the algorithm to be
> well-defined.  If we subsequently do floating point computations
> with the NaNs, the result should be NaN.  I think we should either
> just simplify things and say we *always* return NaN whenever NaNs
> are present for Percentile, or do the following:
> 
> 0) use the prescribed NaN strategy to determine the ordering - so
> NaNs end up either at the bottom, at the top or in their original
> position (FIXED).
> 1) if interpolation involves a NaN element, the result returned is NaN.

This is exactly what I say. If NaNs are in, NaNs should be out. As for
now we do replace them, we get infinities instead of NaNs. I would
prefer getting NaNs there as it is more consistent with the purpose of
NaNs in floating point encoding.

So I think we agree: NaNs should not be changed, but they should be
reordered. Perhaps we could have something simple like NaNStrategy
having a method:

  boolean shouldSwap(int i1, double d1, int i2, double i2)

that would use both the index and values of the element to check if they
should be swapped or not. This method would be used both for traditional
sorting for ranking and in selection algorithms for percentile. I first
thought we should simply have NaN implement Comparator<Double> but this
would not work for FIXED, whereas the above method would work (simply
returning false as soon as there is a NaN, regardless of it being at a
lower or upper index).

I will take a look at this option.

best regards,
Luc

> 
> Phil
>>
>> best regards,
>> Luc
>>
>>>
>>> Phil
>>>
>>>>> I still don't see a within-math use for recoding methods; though
>>>>> I do see the value from a user perspective.  The problem is if we
>>>>> open this door, we risk MathArrays becoming a mini commons lang.
>>>> If Commons Lang already provides a feature, then we indeed don't
>>>> need to offer the same within CM (if there is no internal use). Can
>>>> someone confirm?
>>>>
>>>>
>>>> Otherwise we could define a new inner interface in MathArrays,
>>>> similar to "MathArrays.Function": --- public class MathArrays {
>>>>
>>>> // ...
>>>>
>>>> public interface Transformer { /** Transform the whole array */ 
>>>> double[] transform(double[] input); /** Transform the array in the
>>>> interval [from, to) and return the whole array */ double[]
>>>> transform(double[] input, int from, int to); /** Transform the
>>>> array in the interval [from, to) and return the transformed
>>>> sub-array */ double[] transformAndSplice(double[] input, int from,
>>>> int to); } } ---
>>>>
>>>> Then we can provide "replace" and "remove" transformers through
>>>> factory methods (example/untested code): --- public class
>>>> MathArrays {
>>>>
>>>> // ...
>>>>
>>>> public static Transformer createReplaceTransformer(final double
>>>> old, final double replace) { return new Transformer() { public
>>>> double[] transform(double[] input) { final int len = input.length; 
>>>> final double[] output = new double[len]; for (int i = 0; i < len;
>>>> i++) { output[i] = Precision.equalsIncludingNaN(input[i], old) ? 
>>>> replace : input[i]; } return output; }
>>>>
>>>> // etc. }; } } ---
>>>>
>>>>
>>>> Gilles
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>>
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>>
>>>
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message