Return-Path: X-Original-To: apmail-commons-issues-archive@minotaur.apache.org Delivered-To: apmail-commons-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0EAF810216 for ; Fri, 31 May 2013 19:51:22 +0000 (UTC) Received: (qmail 1959 invoked by uid 500); 31 May 2013 19:51:21 -0000 Delivered-To: apmail-commons-issues-archive@commons.apache.org Received: (qmail 1893 invoked by uid 500); 31 May 2013 19:51:21 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 1553 invoked by uid 99); 31 May 2013 19:51:21 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 May 2013 19:51:21 +0000 Date: Fri, 31 May 2013 19:51:21 +0000 (UTC) From: "Phil Steitz (JIRA)" To: issues@commons.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (MATH-984) Incorrect (bugged) generating function getNextValue() in .random.EmpiricalDistribution MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MATH-984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671773#comment-13671773 ] Phil Steitz edited comment on MATH-984 at 5/31/13 7:50 PM: ----------------------------------------------------------- That should work. Looks like you have hit a new bug, which should be opened as a separate issue if you don't mind doing that. What I suspect is going on is that your data has singleton bins, which results in zero variance within bin. The getKernel method tries to create a NormalDistribution instance using the bin stats. This throws NotStrictlyPositiveException if the standard deviation parameter is not strictly positive. This is part of the reason that the singleton check is there in getNextValue. I forgot to account for this case in inverseCumulativeProbability (added in 3.2). A unit test demonstrating the bug would be most appreciated. I think it would probably be a little more efficient though to keep the direct implementation of getNextValue as it is now, but just fix the bug. was (Author: psteitz): That should work. Looks like you have hit a new bug, which should be opened as a separate issue if you don't mind doing that. What I suspect is going on is that your data has singleton bins, which results in zero variance within bin. The getKernel method tries to create a NormalDistribution instance using the bin stats. This throws NotStrictlyPositiveException if the standard deviation parameter is not strictly positive. This is part of the reason that the singleton check is there in getNextValue. I forgot to account for this case in inverseCumulativeProbability (added in 3.2). A unit test demonstrating the bug would be most appreciated. I think it would probably be a little more efficient though to keep the direct implementation of getNextValue as it is now, but just fix the bug. Arguably, the bug is in getKernel, which should return a distribution object with support equal to the bin. On the other hand, that makes it a little harder for those wanting to supply a custom kernel. > Incorrect (bugged) generating function getNextValue() in .random.EmpiricalDistribution > -------------------------------------------------------------------------------------- > > Key: MATH-984 > URL: https://issues.apache.org/jira/browse/MATH-984 > Project: Commons Math > Issue Type: Bug > Affects Versions: 3.2, 3.1.1 > Environment: all > Reporter: Radoslav Tsvetkov > > The generating function getNextValue() in org.apache.commons.math3.random.EmpiricalDistribution > will generate wrong values for all Distributions that are single tailed or limited. For example Data which are resembling Exponential or Lognormal distributions. > The problem could be easily seen in code and tested. > In last version code > ... > 490 return getKernel(stats).sample(); > ... > it samples from Gaussian distribution to "smooth" in_the_bin. Obviously Gaussian Distribution is not limited and sometimes it does generates numbers outside the bin. In the case when it is the last bin it will generate wrong numbers. > For example for empirical non-negative data it will generate negative rubbish. > Additionally the proposed algorithm boldly returns only the mean value of the bin in case of one value! This last makes the generating function unusable for heavy tailed distributions with small number of values. (for example computer network traffic) > On the last place usage of Gaussian soothing in the bin will change greatly some empirical distribution properties. > The proposed method should be reworked to be applicable for real data which have often limited ranges. (either non-negative or both sides limited) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira