Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 4993 invoked from network); 30 Jan 2009 05:58:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Jan 2009 05:58:19 -0000 Received: (qmail 68407 invoked by uid 500); 30 Jan 2009 05:58:19 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 68370 invoked by uid 500); 30 Jan 2009 05:58:19 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 68359 invoked by uid 99); 30 Jan 2009 05:58:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Jan 2009 21:58:19 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.198.237 as permitted sender) Received: from [209.85.198.237] (HELO rv-out-0506.google.com) (209.85.198.237) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Jan 2009 05:58:10 +0000 Received: by rv-out-0506.google.com with SMTP id g37so283736rvb.35 for ; Thu, 29 Jan 2009 21:57:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=rbIT1VcXmwkAPzsrxsnTJasVcC1pE/Ge7FIPeckGdlA=; b=ddiH1JYmtwT4heWq5/3KrkccYO5/Rf+plOXImntrzFj2DU1e3fmeD8GVi9WHsv2FnN pRiA5Tt+S/TxXb25NGV+Ix3quz0a9NQupf65FklOeMYAbscAbltnCqGeqk3k4W8akIAG FrwBoyWNrZcxt4Z9sOJBcRV/TG7JM995xF4MY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=dtcjfyu2W6bM2SC/axrGnA7gCl5gSWvb3GjY5/nRC7LRDZtNrYZUSG0OVdcUWFIsga Qx1A0TlejXJAFfwoCzfzLVBtl2oXfS+jL/4gU6bFN8jT36hSLqEGroyxEXsrt7vYgDMD 4pf9NmjEBnJwvuBsC3Q8K+mNvLriQxK1mNtKo= MIME-Version: 1.0 Received: by 10.141.210.13 with SMTP id m13mr432878rvq.181.1233295070450; Thu, 29 Jan 2009 21:57:50 -0800 (PST) In-Reply-To: References: Date: Fri, 30 Jan 2009 00:57:50 -0500 Message-ID: Subject: Re: Statistics Module From: Paul Davis To: dev@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On Fri, Jan 30, 2009 at 12:32 AM, Antony Blakey wrote: > > On 30/01/2009, at 9:56 AM, Paul Davis wrote: > >> The way that stats are calculated currently with the dependent >> variable being time could cause some issues in implementing more >> statistics. With my extremely limited knowledge of stats I think >> moving that to be dependent on the number of requests might be better. >> This is something that hopefully someone out there knows more about. >> (This is in terms of "avg for last 5 minutes" vs "avg for last 100 >> requests", (the later of the two making stddev type stats >> calculateable on the fly in constant memory.) > > The problem with using # of requests is that depending on your data, each > request may take a long time. I have this problem at the moment: 1008 > documents in a 3.5G media database. During a compact, the status in > _active_tasks updates every 1000 documents, so you can imagine how useful > that is :/ I thought it had hung (and neither the beam.smp CPU time nor the > IO requests were a good indicator). I spent some time chasing this down as a > bug before realising the problems was in the status granularity! > Actually I don't think that affects my question at all. It may change how we report things though. As in, it may be important to be able to report things that are not single increment/decrement conditions but instead allow for adding arbitrary floating point numbers to the number of recorded data points. IMO, your specific use case only gives my argument about having the dependent variable be requests (or more specifically, data points) be the dependent variable. To explain more clearly the case is this: if we don't treat the collected data points as the dependent variable we are unable to calculate extended statistics like variance/stddev etc. This is because if the dependent variable is time, then the number of data points is unbounded. If that's the case we have unbounded memory usage. (because I know of no incremental algorithms for calculating these statistics without knowledge of past values, I could be wrong) In other words, if we're doing stats for N values, when we store value number N+1, we must know value 0 so it can be removed from the calculations. If the dependent variable is time, then N can be arbitrarily large thus causing memory problems. Your use case just changes each data point from being a inc/dec op to a "store arbitrary number" op. In any case, I'm not at all comfortable relying on solely my knowledge of calculating statistics in an incremental fashion so hopefully there's a stats buff out there who will feel compelled to weigh in. HTH, Paul Davis P.S. For those of you wondering why standard deviation is important, I reference the ever so eloquent Zed Shaw [1] : "Programmers Need To Learn Statistics Or I Will Kill Them All." Also, he is right. [1] http://www.zedshaw.com/rants/programmer_stats.html > Antony Blakey > ------------- > CTO, Linkuistics Pty Ltd > Ph: 0438 840 787 > > The ultimate measure of a man is not where he stands in moments of comfort > and convenience, but where he stands at times of challenge and controversy. > -- Martin Luther King > > >