couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Kimber <mkim...@kana.com>
Subject Am I doing something fundamentally wrong??!!
Date Wed, 23 May 2012 11:08:09 GMT
I have been working with Couchdb for a short while now (I'm a traditionally DBA and inherited
this Couchdb project and yes I know its not SQL!!!).
We use Couchdb to store Maven Build Statistics. Every time a build is run a Statistics report
is generated and uploaded to Couchdb. Our builds are big and we are aiming to bring them down
in size, hence the collection of statistics for analysis to identify are to focus on, demonstrate
improvement and confirm that developers are adopting new practices as we role them out. Now
I've enjoyed working with couch; java script is powerful, Replication magic, schema less datastore,
restful api, incremental map reduce  etc . However I am increasingly thinking couchdb does
not fit our use case and I've been asking myself the following set of questions:

 *   Are we doing something wrong?
 *   Is couchdb the correct data store for our use case?
 *   Is this really big data, it seems relatively small to me?
 *   Are our documents bigger and more complex than the average Couchdb use case?
 *   Would BigCouch make a difference?
 *   Are people really prepared to continue to throw hardware at a problem like this. Is that
cheaper than developer time or software licenses?
A few statistics etc (last 6 months) that puts our Couchdb implementation into perspective:

 *   Number of Documents: 96,848
 *   Total Size of Documents: 52GB (627 over 10MB, largest 16MB)
 *   Average Size of Documents: 0.5MB
 *   Total Number of Array Elements in all docs: 256 Million
 *   Number of Array Element Types: 37 (i.e. each has a different structure which we have
to handle)
 *   Example Document Structure (cut down as GIST could not cope!) :  https://gist.github.com/2774454
 *   Views (no reduce just maps): https://gist.github.com/2774491 and  https://gist.github.com/2774485
 *   Analytics Server: 4 CPU's and 8GB of RAM running on VMware farm
So what's the issue that's making me question our choice of couchdb. Well a single NVP and
null Key map with no reduce view build takes 6 hours to process and burns a full CPU for all
that time i.e. it does not seem to be IO bound or short of memory (does only seem to be able
to use a single CPU/core which is odd erlang and all) . The "Build Profile Detail" Map referenced
above takes up to 15 hours to build. Now once I know what I want that's not necessarily a
major issue, but it is when I need to discover/explore the data that I need to analyse. The
feedback loop to do ad-hoc analysis is not practical. Now I know we live in the world of the
clever compromise/work around so people will say  use a smaller subset. I have its 19 documents
and they are not representative, so I create a map think I have what I need apply to main
data set wait 16 hours and then find that I've missed something. Also if I want to change
the order (key) by or type of grouping (reduce) I have to change the view and have to wait
16 hours again.
To reduce the feedback loop I've hooked up Luciddb using its Couchdb connector and loaded
the data into it. This provides me with a significantly lower feedback loop i.e. 51 seconds
to change a grouping (reduce) on 256million rows rather than 16 hours to rebuild a view for
instant access.
However this also highlighted how much disk space couchdb takes. The two views take up 480MB
and 5.6GB respectively, but when I load them into Luciddb (column orientated) the same data
(minus the name part of the pair) takes up 655MB (with indexes added); what's in a Couch View
(we have Coudhdb 1.2 so they should be compressed the data can't be that big)? Which leads
me back to my set of questions above?
This isn't aiming to be Couchdb bashing post, in fact I'll be continuing to use it, I'm just
looking to see If I'm doing something fundamentally wrong or have just picked the wrong horse
for our course or just need to throw some hardware at it etc? Couchdb/Lucidb is a pretty decent
combo, so if I could bring down the View build time in Couch then I'd be happy, but on the
flip side it seems to be a bit of an anti pattern if I have to throw a load of hardware at
it.
Thanks
Mike



MICHAEL KIMBER

PRINCIPAL ARCHITECT

d: 028 9078 8378 | m: 07792329322

f: 028 9078 8339 | e: mkimber@kana.com<mailto:mkimber@kana.com>

840 W California Avenue, Suite 100
Sunnyvale, CA 94086



[cid:image001.png@01CD38DC.B7E3E9A0]<http://www.kana.com/service-experience-management/stack.php>

EVENTS<http://www.kana.com/sem/events.php> | WHITEPAPERS<http://www.kana.com/customer-service/white-papers.php>
| CASE STUDIES<http://www.kana.com/customer-service/video-case-studies.php> | DEMOS<http://www.kana.com/service-solutions/demo-library.php>

[cid:image002.png@01CD38DC.B7E3E9A0]<http://www.facebook.com/pages/KANA-Software-Inc/146154198748782>

[cid:image003.png@01CD38DC.B7E3E9A0]<http://www.twitter.com/kanasoftware>

[cid:image004.png@01CD38DC.B7E3E9A0]<http://www.linkedin.com/groups/KANA-Software-1129?mostPopular=&gid=1129>




[cid:image005.gif@01CD38DC.B7E3E9A0]<http://www.kana.com/kanaconnect2012/uk/index.html>

The information in this email and any attachments is confidential and intended solely for
the attention and use of the named addressee(s). It may be subject to legal, professional
or other privilege and further distribution of it is strictly prohibited without our authority.
If you are not the intended recipient, you are not authorized to and must not disclose, copy,
distribute, or retain this message or any part of it, and should notify us immediately.




Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message