Return-Path: Delivered-To: apmail-jakarta-commons-dev-archive@apache.org Received: (qmail 57428 invoked from network); 14 Jun 2003 20:51:37 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 14 Jun 2003 20:51:37 -0000 Received: (qmail 15873 invoked by uid 97); 14 Jun 2003 20:53:59 -0000 Delivered-To: qmlist-jakarta-archive-commons-dev@nagoya.betaversion.org Received: (qmail 15866 invoked from network); 14 Jun 2003 20:53:58 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 14 Jun 2003 20:53:58 -0000 Received: (qmail 57187 invoked by uid 500); 14 Jun 2003 20:51:35 -0000 Mailing-List: contact commons-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Jakarta Commons Developers List" Reply-To: "Jakarta Commons Developers List" Delivered-To: mailing list commons-dev@jakarta.apache.org Received: (qmail 57176 invoked from network); 14 Jun 2003 20:51:35 -0000 Received: from unknown (HELO hume.tsdinc.steitz.com) (209.249.229.10) by daedalus.apache.org with SMTP; 14 Jun 2003 20:51:35 -0000 Content-Class: urn:content-classes:message Received: from Lavoie.tsdinc.steitz.com ([209.249.229.4]) by hume.tsdinc.steitz.com with Microsoft SMTPSVC(5.0.2195.5329); Sat, 14 Jun 2003 16:51:40 -0400 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Received: from steitz.com ([130.13.162.175]) by Lavoie.tsdinc.steitz.com with Microsoft SMTPSVC(5.0.2195.5329); Sat, 14 Jun 2003 16:51:34 -0400 Message-ID: <3EEB8ACB.6050704@steitz.com> Date: Sat, 14 Jun 2003 13:51:23 -0700 From: "Phil Steitz" User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020408 X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Jakarta Commons Developers List" Subject: Re: [math] more improvement to storage free mean, variance computation References: <20030614204432.5304.qmail@web41303.mail.yahoo.com> Content-Type: text/plain; format=flowed; charset="us-ascii" Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 14 Jun 2003 20:51:34.0420 (UTC) FILETIME=[BDAD6540:01C332B6] X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Al Chou wrote: > --- Phil Steitz wrote: > >>Al Chou wrote: >> >>>>Date: Wed, 04 Jun 2003 21:05:14 -0700 >>>>From: Phil Steitz >>>>Subject: [math] more improvement to storage free mean, variance computation >>>> >>>>Check out procedure sum.2 and var.2 in >>>> >>>>http://www.stanford.edu/~glynn/PDF/0208.pdf >>>> >>>>The first looks like Brent's suggestion for a corrected mean >>>>computation, with no memory required. The additional computational cost >>>>that I complained about is docuemented to be 3x the flops cost of the >>>>direct computation, but the computation is claimed to be more stable. So >>>>the question is: do we pay the flops cost to get the numerical >>>>stability? The example in the paper is compelling; but it uses small >>>>words (err, numbers I mean -- sorry, slipped in to my native Fortran for >>>>a moment there ;-)). So how do we go about deciding whether the >>>>stability in the mean computation is worth the increased computational >>>>effort? I would prefer not to answer "let the user decide". To make >>>>the decision harder, we should note that it is actually worse than 3x, >>>>since in the no storage version, the user may request the mean only >>>>rarely (if at all) and the 3x comparison is against computiing the mean >>>>for each value added. >>>> >>>>The variance formula looks better than what we have now, still requiring >>>>no memory. Should we implement this for the no storage case? >>> >>> >>>After implementing var.2 from the Stanford paper in UnivariateImpl and >>>scratching my head for some time over why the variance calculation failed >> >>its >> >>>JUnit test case, I realized there's a flaw in var.2 that I can't understand >> >>no >> >>>one talks about. To update the variance (called S in the paper), the >> >>formula >> >>>calculates >>> >>>z = y / i >>>S = S + (i-1) * y * z >>> >>>where i is the number of data values (including the value just being added >> >>to >> >>>the collection). It doesn't really matter how y is defined, because you >> >>will >> >>>notice that >>> >>>S = S + (i-1) * y * y / i >>> = S + (i-1) * y**2 / i >>> >>>which means that S can never decrease in magnitude (for real data, which is >>>what we're talking about). But for the simple case of three data values >> >>{1, 2, >> >>>2} in the JUnit test case, the variance decreases between the addition of >> >>the >> >>>second and third data values. >>> >>>Can anyone point out what I'm missing here? >>> >>> >> >>I think that is OK, since if you look at the definition of S earlier in >>the paper, S is not the variance, it is the sum of the squared >>deviations from the mean. This should be always increasing. > > > Where is that definition? I'm looking at equations 3 and 4, which define > S_{1,q} (in LaTeX notation), and the return statement in algorithm Procedure > var.2, which says S_{1,q} = S. Equation 3 defines S to be the sum of squared differences between L_i and L-bar, which are defined on p. 209 to be the observed values and their mean. > > Anyway, I think the resolution is contained in messages to follow shortly. > > > Al > > ===== > Albert Davidson Chou > > Get answers to Mac questions at http://www.Mac-Mgrs.org/ . > > __________________________________ > Do you Yahoo!? > SBC Yahoo! DSL - Now only $29.95 per month! > http://sbc.yahoo.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: commons-dev-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: commons-dev-help@jakarta.apache.org