Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of mkimber@kana.com designates
 64.95.72.244 as permitted sender)
From: Mike Kimber <mkimber@kana.com>
To: "user@couchdb.apache.org" <user@couchdb.apache.org>
Date: Thu, 24 May 2012 02:32:33 -0400
Subject: FW: Am I doing something fundamentally wrong?
Thread-Topic: Am I doing something fundamentally wrong?
Thread-Index: Ac041FYhHkT5NS/lQxqQHKBHQ1GlaQAonBAA
Message-ID: <A7D50E04F38FD44D9D914F2ABCA592BF2E6E690685@BE259.mail.lan>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_A7D50E04F38FD44D9D914F2ABCA592BF2E6E690685BE259maillan_"
MIME-Version: 1.0

--_000_A7D50E04F38FD44D9D914F2ABCA592BF2E6E690685BE259maillan_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Didn't seem to get there first time so having another go

Mike

From: Mike Kimber
Sent: 23 May 2012 12:08
To: user@couchdb.apache.org
Subject: Am I doing something fundamentally wrong??!!

I have been working with Couchdb for a short while now (I'm a traditionally=
 DBA and inherited this Couchdb project and yes I know its not SQL!!!).
We use Couchdb to store Maven Build Statistics. Every time a build is run a=
 Statistics report is generated and uploaded to Couchdb. Our builds are big=
 and we are aiming to bring them down in size, hence the collection of stat=
istics for analysis to identify are to focus on, demonstrate improvement an=
d confirm that developers are adopting new practices as we role them out. N=
ow I've enjoyed working with couch; java script is powerful, Replication ma=
gic, schema less datastore, restful api, incremental map reduce  etc . Howe=
ver I am increasingly thinking couchdb does not fit our use case and I've b=
een asking myself the following set of questions:

 *   Are we doing something wrong?
 *   Is couchdb the correct data store for our use case?
 *   Is this really big data, it seems relatively small to me?
 *   Are our documents bigger and more complex than the average Couchdb use=
 case?
 *   Would BigCouch make a difference?
 *   Are people really prepared to continue to throw hardware at a problem =
like this. Is that cheaper than developer time or software licenses?
A few statistics etc (last 6 months) that puts our Couchdb implementation i=
nto perspective:

 *   Number of Documents: 96,848
 *   Total Size of Documents: 52GB (627 docs over 10MB, largest 16MB)(compr=
essed its 8.5GB)
 *   Average Size of Documents: 0.5MB
 *   Total Number of Array Elements in all docs: 256 Million
 *   Number of Array Element Types: 37 (i.e. each has a different structure=
 which we have to handle)
 *   Example Document Structure (cut down as GIST could not cope!) :  https=
://gist.github.com/2774454
 *   Views (no reduce just maps): https://gist.github.com/2774491 and  http=
s://gist.github.com/2774485
 *   Analytics Server: 4 CPU's and 8GB of RAM running on VMware farm
So what's the issue that's making me question our choice of couchdb. Well a=
 single NVP and null Key map with no reduce view build takes 6 hours to pro=
cess and burns a full CPU for all that time i.e. it does not seem to be IO =
bound or short of memory (does only seem to be able to use a single CPU/cor=
e which is odd erlang and all) . The "Build Profile Detail" Map referenced =
above takes up to 15 hours to build. Now once I know what I want that's not=
 necessarily a major issue, but it is when I need to discover/explore the d=
ata that I need to analyse. The feedback loop to do ad-hoc analysis is not =
practical. Now I know we live in the world of the clever compromise/work ar=
ound so people will say  use a smaller subset. I have its 19 documents and =
they are not representative, so I create a map think I have what I need app=
ly to main data set wait 16 hours and then find that I've missed something.=
 Also if I want to change the order (key) by or type of grouping (reduce) I=
 have to change the view and have to wait 16 hours again.
To reduce the feedback loop I've hooked up Luciddb using its Couchdb connec=
tor and loaded the data into it. This provides me with a significantly lowe=
r feedback loop i.e. 51 seconds to change a grouping (reduce) on 256million=
 rows rather than 16 hours to rebuild a view for instant access.
However this also highlighted how much disk space couchdb takes. The two vi=
ews take up 480MB and 5.6GB respectively, but when I load them into Luciddb=
 (column orientated) the same data (minus the name part of the pair) takes =
up 655MB (with indexes added); what's in a Couch View (we have Coudhdb 1.2 =
so they should be compressed the data can't be that big)? Which leads me ba=
ck to my set of questions above?
This isn't aiming to be Couchdb bashing post, in fact I'll be continuing to=
 use it, I'm just looking to see If I'm doing something fundamentally wrong=
 or have just picked the wrong horse for our course or just need to throw s=
ome hardware at it etc? Couchdb/Lucidb is a pretty decent combo, so if I co=
uld bring down the View build time in Couch then I'd be happy, but on the f=
lip side it seems to be a bit of an anti pattern if I have to throw a load =
of hardware at it.
Thanks
Mike

--_000_A7D50E04F38FD44D9D914F2ABCA592BF2E6E690685BE259maillan_--