From commons-dev-return-30769-qmlist-jakarta-archive-commons-dev=nagoya.apache.org@jakarta.apache.org Sat Jun 14 20:44:29 2003
Return-Path:
Delivered-To: apmail-jakarta-commons-dev-archive@apache.org
Received: (qmail 53791 invoked from network); 14 Jun 2003 20:44:29 -0000
Received: from exchange.sun.com (192.18.33.10)
by daedalus.apache.org with SMTP; 14 Jun 2003 20:44:29 -0000
Received: (qmail 15785 invoked by uid 97); 14 Jun 2003 20:46:51 -0000
Delivered-To: qmlist-jakarta-archive-commons-dev@nagoya.betaversion.org
Received: (qmail 15778 invoked from network); 14 Jun 2003 20:46:50 -0000
Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12)
by nagoya.betaversion.org with SMTP; 14 Jun 2003 20:46:50 -0000
Received: (qmail 53496 invoked by uid 500); 14 Jun 2003 20:44:26 -0000
Mailing-List: contact commons-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
List-Unsubscribe:
List-Subscribe:
List-Help:
List-Post:
List-Id: "Jakarta Commons Developers List"
Reply-To: "Jakarta Commons Developers List"
Delivered-To: mailing list commons-dev@jakarta.apache.org
Received: (qmail 53469 invoked from network); 14 Jun 2003 20:44:26 -0000
Received: from web41303.mail.yahoo.com (66.218.93.52)
by daedalus.apache.org with SMTP; 14 Jun 2003 20:44:26 -0000
Message-ID: <20030614204432.5304.qmail@web41303.mail.yahoo.com>
Received: from [68.164.16.54] by web41303.mail.yahoo.com via HTTP; Sat, 14 Jun 2003 13:44:32 PDT
Date: Sat, 14 Jun 2003 13:44:32 -0700 (PDT)
From: Al Chou
Subject: Re: [math] more improvement to storage free mean, variance computation
To: Jakarta Commons Developers List
In-Reply-To: <3EEB86DD.3020008@steitz.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N
X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N
--- Phil Steitz wrote:
> Al Chou wrote:
> >>Date: Wed, 04 Jun 2003 21:05:14 -0700
> >>From: Phil Steitz
> >>Subject: [math] more improvement to storage free mean, variance computation
> >>
> >>Check out procedure sum.2 and var.2 in
> >>
> >>http://www.stanford.edu/~glynn/PDF/0208.pdf
> >>
> >>The first looks like Brent's suggestion for a corrected mean
> >>computation, with no memory required. The additional computational cost
> >>that I complained about is docuemented to be 3x the flops cost of the
> >>direct computation, but the computation is claimed to be more stable. So
> >>the question is: do we pay the flops cost to get the numerical
> >>stability? The example in the paper is compelling; but it uses small
> >>words (err, numbers I mean -- sorry, slipped in to my native Fortran for
> >>a moment there ;-)). So how do we go about deciding whether the
> >>stability in the mean computation is worth the increased computational
> >>effort? I would prefer not to answer "let the user decide". To make
> >>the decision harder, we should note that it is actually worse than 3x,
> >>since in the no storage version, the user may request the mean only
> >>rarely (if at all) and the 3x comparison is against computiing the mean
> >>for each value added.
> >>
> >>The variance formula looks better than what we have now, still requiring
> >>no memory. Should we implement this for the no storage case?
> >
> >
> > After implementing var.2 from the Stanford paper in UnivariateImpl and
> > scratching my head for some time over why the variance calculation failed
> its
> > JUnit test case, I realized there's a flaw in var.2 that I can't understand
> no
> > one talks about. To update the variance (called S in the paper), the
> formula
> > calculates
> >
> > z = y / i
> > S = S + (i-1) * y * z
> >
> > where i is the number of data values (including the value just being added
> to
> > the collection). It doesn't really matter how y is defined, because you
> will
> > notice that
> >
> > S = S + (i-1) * y * y / i
> > = S + (i-1) * y**2 / i
> >
> > which means that S can never decrease in magnitude (for real data, which is
> > what we're talking about). But for the simple case of three data values
> {1, 2,
> > 2} in the JUnit test case, the variance decreases between the addition of
> the
> > second and third data values.
> >
> > Can anyone point out what I'm missing here?
> >
> >
> I think that is OK, since if you look at the definition of S earlier in
> the paper, S is not the variance, it is the sum of the squared
> deviations from the mean. This should be always increasing.
Where is that definition? I'm looking at equations 3 and 4, which define
S_{1,q} (in LaTeX notation), and the return statement in algorithm Procedure
var.2, which says S_{1,q} = S.
Anyway, I think the resolution is contained in messages to follow shortly.
Al
=====
Albert Davidson Chou
Get answers to Mac questions at http://www.Mac-Mgrs.org/ .
__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org