[ https://issues.apache.org/jira/browse/MATH984?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13669588#comment13669588
]
Phil Steitz edited comment on MATH984 at 5/29/13 7:12 PM:

Thanks, I get the second problem now. To really address that issue, I think we would need
to depart from the current simple equalsized bins model. I have thought before about either
introducing a new class or a config option for EmpiricalDistribution that supported alternative
binning structures, such as:
1. equiprobable bins (so bin size is not constant)
2. variable bin sizes (break the total range into subranges, allow a number of fixedsize
bins to be specified for each range, so grid could become very fine in densely packed subranges).
Regarding the default kernel choice, in the absence of any information about the withinbin
distributions, I would expect the (correctly truncated) Gaussian smoother to perform better
than uniform or triangular. I don't have a proof of this statement either; but I will see
if I can hunt down some references, starting with [1], referenced in the class javadoc as
what the current implementation is based on.
As [1] states, heavy tails create problems for this approach and in some cases an alternative
to the Gaussian kernel may work better. This could actually be tested on a case basis by
comparing probabilities computed using the Distribution methods now implemented in the class
with "true" empirical probabilities from the raw data. Some experiments doing this with different
kernels and data would be interesting to look at.
[1] http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver2_4.html
was (Author: psteitz):
Thanks, I get the second problem now. To really address that issue, I think we would
need to depart from the current simple equalsized bins model. I have thought before about
either introducing a new class or a config option for EmpiricalDistribution that supported
alternative binning structures, such as:
1. equiprobable bins (so bin size is not constant)
2. variable bin sizes (break the total range into subranges, allow a number of fixedsize
bins to be specified for each range, so grid could become very fine in densely packed subranges).
Regarding the default kernel choice, in the absence of any information about the withinbin
distributions, I would expect the (correctly truncated) Gaussian smoother to perform better
than uniform or triangular. I don't have a proof of this statement either; but I will see
if I can hunt down some references, starting with [1], referenced in the class javadoc as
what the current implementation is based on.
As [1] states, heavy tails create problems for this approach and in some cases an alternative
to the Gaussian kernel may work better. This could actually be tested on a case basis by
comparing probabilities computed using the Distribution methods now implemented in the class
with "true" empirical probabilities from the raw data. Some experiments doing this with different
kernels and data would be interesting to look at.
[1] http://ned.ipac.caltech.edu/level5/March02/Silverman/Silver2_4.html
[1]
> Incorrect (bugged) generating function getNextValue() in .random.EmpiricalDistribution
> 
>
> Key: MATH984
> URL: https://issues.apache.org/jira/browse/MATH984
> Project: Commons Math
> Issue Type: Bug
> Affects Versions: 3.2, 3.1.1
> Environment: all
> Reporter: Radoslav Tsvetkov
>
> The generating function getNextValue() in org.apache.commons.math3.random.EmpiricalDistribution
> will generate wrong values for all Distributions that are single tailed or limited. For
example Data which are resembling Exponential or Lognormal distributions.
> The problem could be easily seen in code and tested.
> In last version code
> ...
> 490 return getKernel(stats).sample();
> ...
> it samples from Gaussian distribution to "smooth" in_the_bin. Obviously Gaussian Distribution
is not limited and sometimes it does generates numbers outside the bin. In the case when it
is the last bin it will generate wrong numbers.
> For example for empirical nonnegative data it will generate negative rubbish.
> Additionally the proposed algorithm boldly returns only the mean value of the bin in
case of one value! This last makes the generating function unusable for heavy tailed distributions
with small number of values. (for example computer network traffic)
> On the last place usage of Gaussian soothing in the bin will change greatly some empirical
distribution properties.
> The proposed method should be reworked to be applicable for real data which have often
limited ranges. (either nonnegative or both sides limited)

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
