couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: [jira] Created: (COUCHDB-396) Fixing weirdness in couch_stats_aggregator.erl
Date Sun, 28 Jun 2009 16:50:56 GMT
On Sun, Jun 28, 2009 at 10:28 AM, Robert
Dionne<dionne@dionne-associates.com> wrote:
> This seems like good improvement. I've only superficially reviewed the new
> version, but have a couple of trivial and one non-trivial comments:
>
> 1. The ini file is growing. This might be a good candidate for it's own ini
> file. It's somewhat orthogonal and could be nicely boxed out and ignored by
> folks with no interest in such things. The downside is the complexity of
> multiple files
>

Definitely agree. Another thing I noticed is that the startup message
with debug logging is also a bit longish. Noah's already got it setup
to allow for multiple files using the default.d/ directories, but I
wasn't quite certain on how to setup the build system

> 2. Would it be helpful to be able to enable/disable stats completely. These
> calculations must add some overhead.
>

That definitely seems reasonable though I'm not entirely certain how
best to implement this.

> 3. The use of moving averages is great, but as you comment there be quite a
> lot of variability within a given time interval. Moving averages are
> generally useful only over time, for example in making short term trading
> decisions a moving average can help guess the direction of the next
> reversion to a mean. In this scenario I would think peak usages would also
> be of value. One could maintain min/max stats with respect to these moving
> averages along with a time interval in order to identify hot spots.
>

Sounds reasonable. I'm not sure if min/max is more or less proper than
quartiles. Or maybe just different? My stats-fu is less than stellar.

> I'll have a closer look and write some tests
>
>
>
>
> On Jun 27, 2009, at 9:32 PM, Paul Joseph Davis (JIRA) wrote:
>
>> Fixing weirdness in couch_stats_aggregator.erl
>> ----------------------------------------------
>>
>>                Key: COUCHDB-396
>>                URL: https://issues.apache.org/jira/browse/COUCHDB-396
>>            Project: CouchDB
>>         Issue Type: Improvement
>>         Components: Database Core, HTTP Interface
>>   Affects Versions: 0.10
>>        Environment: trunk
>>           Reporter: Paul Joseph Davis
>>           Assignee: Paul Joseph Davis
>>            Fix For: 0.10
>>        Attachments: couchdb_stats_aggregator.patch
>>
>> Looking at adding unit tests to the couchdb_stats_aggregator module the
>> other day I realized it was doing some odd calculations. This is a fairly
>> non-trivial patch so I figured that I'd put in JIRA and get feed back before
>> applying. This patch does everything the old version does afaict, but I'll
>> be adding tests before I consider it complete.
>>
>> List of major changes:
>>
>> * The old behavior for stats was to integrate incoming values for a time
>> period and then reset the values and start integrating again. That seemed a
>> bit odd so I rewrote things to keep the average and standard deviation for
>> the last N seconds with approximately 1 sample per second.
>> * Changed request timing calculations [note below]
>> * Sample periods are configurable in the .ini file. Sample periods of 0
>> are a special case and integrate all values from couchdb boot up.
>> * Sample descriptions are in the configuration files now.
>> * You can request different time periods for the root stats end point.
>> * Added a sum to the list of statistics
>> * Simplified some of the external API
>>
>> The biggest change is in how time for requests are calculated. AFAICT, the
>> old way was accumulating request timings in the stats collector and just
>> adding new values as clock ticks went by as everything else does which makes
>> sense in the case of resetting counters every time period. In the new way
>> I'm keeping a list of the samples in the last time period and when I get a
>> clock tick part of the update is to remove the samples that have passed out
>> of the time period. For a variable like request_time this would lead to
>> unbounded storage.
>>
>> The new method is calculating the average time of all requests in a single
>> clock tick (1s). One thing this loses is when you start having lots of
>> variability in a single clock tick. Ie, your average request time is 100ms,
>> but 10% of your requests are taking 500ms. I've read of people doing the
>> averaging trick but also storing quantile information as well [1]. There are
>> also algorithms for doing single pass quantile estimation and the like so
>> its possible to do those things in O(N) time. The issue with quantiles is
>> that it'd start breaking the logic of how the collector and aggregators are
>> setup. As it is now, there's basically a one event -> one stat constraint.
>> For the time being I went without quartiles to minimize the impact of the
>> patch.
>>
>> This code will also be on github [3] as I add patches.
>>
>>
>> [1] http://code.flickr.com/blog/2008/10/27/counting-timing/
>> [2]
>> http://www.slamb.org/svn/repos/trunk/projects/loadtest/benchtools/stats.py (See
>> the QuantileEstimator class)
>> [3] http://github.com/davisp/couchdb/tree/stats-patch
>>
>>
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>
>

Mime
View raw message