Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 50A48C3C3 for ; Thu, 24 May 2012 06:33:11 +0000 (UTC) Received: (qmail 96388 invoked by uid 500); 24 May 2012 06:33:09 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 96058 invoked by uid 500); 24 May 2012 06:33:06 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 96033 invoked by uid 99); 24 May 2012 06:33:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 May 2012 06:33:05 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mkimber@kana.com designates 64.95.72.244 as permitted sender) Received: from [64.95.72.244] (HELO mxout.myoutlookonline.com) (64.95.72.244) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 May 2012 06:33:00 +0000 Received: from mxout.myoutlookonline.com (localhost [127.0.0.1]) by mxout.myoutlookonline.com (Postfix) with ESMTP id A41EE8BDFC6 for ; Thu, 24 May 2012 02:32:38 -0400 (EDT) X-Virus-Scanned: by SpamTitan at mail.lan Received: from HUB012.mail.lan (unknown [10.110.2.1]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by mxout.myoutlookonline.com (Postfix) with ESMTPS id 152948BDE69 for ; Thu, 24 May 2012 02:32:37 -0400 (EDT) Received: from BE259.mail.lan ([10.110.32.159]) by HUB012.mail.lan ([10.110.17.12]) with mapi; Thu, 24 May 2012 02:32:31 -0400 From: Mike Kimber To: "user@couchdb.apache.org" Date: Thu, 24 May 2012 02:32:33 -0400 Subject: FW: Am I doing something fundamentally wrong? Thread-Topic: Am I doing something fundamentally wrong? Thread-Index: Ac041FYhHkT5NS/lQxqQHKBHQ1GlaQAonBAA Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_A7D50E04F38FD44D9D914F2ABCA592BF2E6E690685BE259maillan_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_A7D50E04F38FD44D9D914F2ABCA592BF2E6E690685BE259maillan_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Didn't seem to get there first time so having another go Mike From: Mike Kimber Sent: 23 May 2012 12:08 To: user@couchdb.apache.org Subject: Am I doing something fundamentally wrong??!! I have been working with Couchdb for a short while now (I'm a traditionally= DBA and inherited this Couchdb project and yes I know its not SQL!!!). We use Couchdb to store Maven Build Statistics. Every time a build is run a= Statistics report is generated and uploaded to Couchdb. Our builds are big= and we are aiming to bring them down in size, hence the collection of stat= istics for analysis to identify are to focus on, demonstrate improvement an= d confirm that developers are adopting new practices as we role them out. N= ow I've enjoyed working with couch; java script is powerful, Replication ma= gic, schema less datastore, restful api, incremental map reduce etc . Howe= ver I am increasingly thinking couchdb does not fit our use case and I've b= een asking myself the following set of questions: * Are we doing something wrong? * Is couchdb the correct data store for our use case? * Is this really big data, it seems relatively small to me? * Are our documents bigger and more complex than the average Couchdb use= case? * Would BigCouch make a difference? * Are people really prepared to continue to throw hardware at a problem = like this. Is that cheaper than developer time or software licenses? A few statistics etc (last 6 months) that puts our Couchdb implementation i= nto perspective: * Number of Documents: 96,848 * Total Size of Documents: 52GB (627 docs over 10MB, largest 16MB)(compr= essed its 8.5GB) * Average Size of Documents: 0.5MB * Total Number of Array Elements in all docs: 256 Million * Number of Array Element Types: 37 (i.e. each has a different structure= which we have to handle) * Example Document Structure (cut down as GIST could not cope!) : https= ://gist.github.com/2774454 * Views (no reduce just maps): https://gist.github.com/2774491 and http= s://gist.github.com/2774485 * Analytics Server: 4 CPU's and 8GB of RAM running on VMware farm So what's the issue that's making me question our choice of couchdb. Well a= single NVP and null Key map with no reduce view build takes 6 hours to pro= cess and burns a full CPU for all that time i.e. it does not seem to be IO = bound or short of memory (does only seem to be able to use a single CPU/cor= e which is odd erlang and all) . The "Build Profile Detail" Map referenced = above takes up to 15 hours to build. Now once I know what I want that's not= necessarily a major issue, but it is when I need to discover/explore the d= ata that I need to analyse. The feedback loop to do ad-hoc analysis is not = practical. Now I know we live in the world of the clever compromise/work ar= ound so people will say use a smaller subset. I have its 19 documents and = they are not representative, so I create a map think I have what I need app= ly to main data set wait 16 hours and then find that I've missed something.= Also if I want to change the order (key) by or type of grouping (reduce) I= have to change the view and have to wait 16 hours again. To reduce the feedback loop I've hooked up Luciddb using its Couchdb connec= tor and loaded the data into it. This provides me with a significantly lowe= r feedback loop i.e. 51 seconds to change a grouping (reduce) on 256million= rows rather than 16 hours to rebuild a view for instant access. However this also highlighted how much disk space couchdb takes. The two vi= ews take up 480MB and 5.6GB respectively, but when I load them into Luciddb= (column orientated) the same data (minus the name part of the pair) takes = up 655MB (with indexes added); what's in a Couch View (we have Coudhdb 1.2 = so they should be compressed the data can't be that big)? Which leads me ba= ck to my set of questions above? This isn't aiming to be Couchdb bashing post, in fact I'll be continuing to= use it, I'm just looking to see If I'm doing something fundamentally wrong= or have just picked the wrong horse for our course or just need to throw s= ome hardware at it etc? Couchdb/Lucidb is a pretty decent combo, so if I co= uld bring down the View build time in Couch then I'd be happy, but on the f= lip side it seems to be a bit of an anti pattern if I have to throw a load = of hardware at it. Thanks Mike --_000_A7D50E04F38FD44D9D914F2ABCA592BF2E6E690685BE259maillan_--