I have been working with Couchdb for a short while now (I'm a traditionally DBA and inherited this Couchdb project and yes I know its not SQL!!!).
We use Couchdb to store Maven Build Statistics. Every time a build is run a Statistics report is generated and uploaded to Couchdb. Our builds are big and we are aiming to bring them down in size, hence the collection of statistics for analysis to identify are to focus on, demonstrate improvement and confirm that developers are adopting new practices as we role them out. Now I've enjoyed working with couch; java script is powerful, Replication magic, schema less datastore, restful api, incremental map reduce etc . However I am increasingly thinking couchdb does not fit our use case and I've been asking myself the following set of questions:
A few statistics etc (last 6 months) that puts our Couchdb implementation into perspective:
So what's the issue that's making me question our choice of couchdb. Well a single NVP and null Key map with no reduce view build takes 6 hours to process and burns a full CPU for all that time i.e. it does not seem to be IO bound or short of memory (does only seem to be able to use a single CPU/core which is odd erlang and all) . The "Build Profile Detail" Map referenced above takes up to 15 hours to build. Now once I know what I want that's not necessarily a major issue, but it is when I need to discover/explore the data that I need to analyse. The feedback loop to do ad-hoc analysis is not practical. Now I know we live in the world of the clever compromise/work around so people will say use a smaller subset. I have its 19 documents and they are not representative, so I create a map think I have what I need apply to main data set wait 16 hours and then find that I've missed something. Also if I want to change the order (key) by or type of grouping (reduce) I have to change the view and have to wait 16 hours again.
To reduce the feedback loop I've hooked up Luciddb using its Couchdb connector and loaded the data into it. This provides me with a significantly lower feedback loop i.e. 51 seconds to change a grouping (reduce) on 256million rows rather than 16 hours to rebuild a view for instant access.
However this also highlighted how much disk space couchdb takes. The two views take up 480MB and 5.6GB respectively, but when I load them into Luciddb (column orientated) the same data (minus the name part of the pair) takes up 655MB (with indexes added); what's in a Couch View (we have Coudhdb 1.2 so they should be compressed the data can’t be that big)? Which leads me back to my set of questions above?
This isn't aiming to be Couchdb bashing post, in fact I'll be continuing to use it, I'm just looking to see If I'm doing something fundamentally wrong or have just picked the wrong horse for our course or just need to throw some hardware at it etc? Couchdb/Lucidb is a pretty decent combo, so if I could bring down the View build time in Couch then I'd be happy, but on the flip side it seems to be a bit of an anti pattern if I have to throw a load of hardware at it.
The information in this email and any attachments is confidential and intended solely for the attention and use of the named addressee(s). It may be subject to legal, professional or other privilege and further distribution of it is strictly prohibited without our authority. If you are not the intended recipient, you are not authorized to and must not disclose, copy, distribute, or retain this message or any part of it, and should notify us immediately.