incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Kimber <mkim...@kana.com>
Subject FW: Am I doing something fundamentally wrong?
Date Thu, 24 May 2012 06:32:33 GMT
Didn't seem to get there first time so having another go

Mike

From: Mike Kimber
Sent: 23 May 2012 12:08
To: user@couchdb.apache.org
Subject: Am I doing something fundamentally wrong??!!

I have been working with Couchdb for a short while now (I'm a traditionally DBA and inherited
this Couchdb project and yes I know its not SQL!!!).
We use Couchdb to store Maven Build Statistics. Every time a build is run a Statistics report
is generated and uploaded to Couchdb. Our builds are big and we are aiming to bring them down
in size, hence the collection of statistics for analysis to identify are to focus on, demonstrate
improvement and confirm that developers are adopting new practices as we role them out. Now
I've enjoyed working with couch; java script is powerful, Replication magic, schema less datastore,
restful api, incremental map reduce  etc . However I am increasingly thinking couchdb does
not fit our use case and I've been asking myself the following set of questions:

 *   Are we doing something wrong?
 *   Is couchdb the correct data store for our use case?
 *   Is this really big data, it seems relatively small to me?
 *   Are our documents bigger and more complex than the average Couchdb use case?
 *   Would BigCouch make a difference?
 *   Are people really prepared to continue to throw hardware at a problem like this. Is that
cheaper than developer time or software licenses?
A few statistics etc (last 6 months) that puts our Couchdb implementation into perspective:

 *   Number of Documents: 96,848
 *   Total Size of Documents: 52GB (627 docs over 10MB, largest 16MB)(compressed its 8.5GB)
 *   Average Size of Documents: 0.5MB
 *   Total Number of Array Elements in all docs: 256 Million
 *   Number of Array Element Types: 37 (i.e. each has a different structure which we have
to handle)
 *   Example Document Structure (cut down as GIST could not cope!) :  https://gist.github.com/2774454
 *   Views (no reduce just maps): https://gist.github.com/2774491 and  https://gist.github.com/2774485
 *   Analytics Server: 4 CPU's and 8GB of RAM running on VMware farm
So what's the issue that's making me question our choice of couchdb. Well a single NVP and
null Key map with no reduce view build takes 6 hours to process and burns a full CPU for all
that time i.e. it does not seem to be IO bound or short of memory (does only seem to be able
to use a single CPU/core which is odd erlang and all) . The "Build Profile Detail" Map referenced
above takes up to 15 hours to build. Now once I know what I want that's not necessarily a
major issue, but it is when I need to discover/explore the data that I need to analyse. The
feedback loop to do ad-hoc analysis is not practical. Now I know we live in the world of the
clever compromise/work around so people will say  use a smaller subset. I have its 19 documents
and they are not representative, so I create a map think I have what I need apply to main
data set wait 16 hours and then find that I've missed something. Also if I want to change
the order (key) by or type of grouping (reduce) I have to change the view and have to wait
16 hours again.
To reduce the feedback loop I've hooked up Luciddb using its Couchdb connector and loaded
the data into it. This provides me with a significantly lower feedback loop i.e. 51 seconds
to change a grouping (reduce) on 256million rows rather than 16 hours to rebuild a view for
instant access.
However this also highlighted how much disk space couchdb takes. The two views take up 480MB
and 5.6GB respectively, but when I load them into Luciddb (column orientated) the same data
(minus the name part of the pair) takes up 655MB (with indexes added); what's in a Couch View
(we have Coudhdb 1.2 so they should be compressed the data can't be that big)? Which leads
me back to my set of questions above?
This isn't aiming to be Couchdb bashing post, in fact I'll be continuing to use it, I'm just
looking to see If I'm doing something fundamentally wrong or have just picked the wrong horse
for our course or just need to throw some hardware at it etc? Couchdb/Lucidb is a pretty decent
combo, so if I could bring down the View build time in Couch then I'd be happy, but on the
flip side it seems to be a bit of an anti pattern if I have to throw a load of hardware at
it.
Thanks
Mike

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message