commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Radoslav Tsvetkov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MATH-984) Incorrect (bugged) generating function getNextValue() in .random.EmpiricalDistribution
Date Mon, 03 Jun 2013 06:58:20 GMT

    [ https://issues.apache.org/jira/browse/MATH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672859#comment-13672859
] 

Radoslav Tsvetkov edited comment on MATH-984 at 6/3/13 6:57 AM:
----------------------------------------------------------------

To generate some Exception on attempt to calculate St.Dev. from single observation is correct.
It’s just that the _Exception Text_ generated by getKernel() is not clearly pointing the
cause. An average user should dig and search to get to to the "real" cause. But as getKernel()
should deliver the "kernel" and not proliferate his chosen internal method restrictions outside,
better would be _for single observation_ to deliver uniformly distributed value. (and perhaps
warning).

generate() within EmpiricalDistribution could be "fixed" easily if it uses truncated Gaussian
on the end bins. 

*So would propose 2 code changes:*

1. Add generate(double min, double max) function. Where *min* and *max* should indicate the
hard limits of generation. They should be checked if the conform to input data. It means *min*
should not be greater than the minimal empirically observed value, and *max* should be no
less then the biggest empirical value. Then the new generate function should use truncated
Gaussian on the edges. (open if one implements it through additional getKernel(limits.. )
function )

2. Change getKernel() to deliver uniformly distributed value on single observation. (and perhaps
warning)
 

The first change does not touch old code. So, it's a practically no risk and we retain 100%
code compatibility.
The second change is also low risk and is compatible with old code.

As a result we have smooth and correct generation and positively fixed  inverse CDF.
                
      was (Author: rtsvet):
    To generate a Exception on attempt to calculate St.Dev. from single observation is correct.
It’s just that the _Exception Text_ generated by getKernel() is not clearly pointing the
cause. An average user should dig and search to get to to the "real" cause. But as getKernel()
should deliver the "kernel" and not proliferate his chosen internal method restrictions outside,
better would be _for single observation_ to deliver uniformly distributed value. (and perhaps
warning).

generate() within EmpiricalDistribution could be "fixed" easily if it uses truncated Gaussian
on the end bins. 

*So would propose 2 changes:*

1. Add generate(double min, double max) function Where *min* and *max* should indicate the
hard limits of generation. They should be checked if the conform to input data. It means *min*
should not be greater than the minimal empirically observed value, and *max* should be no
less then the biggest value of the input for the EmpiricalDistribution. Then the new generate
function should use truncated Gaussian on the edges.

2. Change getKernel() to deliver uniformly distributed value on single observation. (and perhaps
warning)
 

The first change does not touch old code. So, it's a practically no risk and we retain 100%
code compatibility.
The second change is also low risk and is compatible with old code.

As a result we have smooth and correct generation and positively fixed  inverse CDF.
                  
> Incorrect (bugged) generating function getNextValue() in .random.EmpiricalDistribution
> --------------------------------------------------------------------------------------
>
>                 Key: MATH-984
>                 URL: https://issues.apache.org/jira/browse/MATH-984
>             Project: Commons Math
>          Issue Type: Bug
>    Affects Versions: 3.2, 3.1.1
>         Environment: all
>            Reporter: Radoslav Tsvetkov
>
> The generating function getNextValue() in org.apache.commons.math3.random.EmpiricalDistribution
> will generate wrong values for all Distributions that are single tailed or limited. For
example Data which are resembling Exponential or Lognormal distributions.
> The problem could be easily seen in code and tested.
> In last version code
> ...
> 490               return getKernel(stats).sample();
> ...
> it samples from Gaussian distribution to "smooth" in_the_bin. Obviously Gaussian Distribution
is not limited and sometimes it does generates numbers outside the bin. In the case when it
is the last bin it will generate wrong numbers. 
> For example for empirical non-negative data it will generate negative rubbish.
>   Additionally the proposed algorithm boldly returns only the mean value of the bin in
case of one value! This last makes the generating function unusable for heavy tailed distributions
with small number of values. (for example computer network traffic)
> On the last place usage of Gaussian soothing in the bin will change greatly some empirical
distribution properties.
> The proposed method should be reworked to be applicable for real data which have often
limited ranges. (either non-negative or both sides limited)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message