couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahesh Paolini-Subramanya <mah...@aptela.com>
Subject Re: Performance of many documents vs large documents
Date Wed, 11 Jan 2012 03:07:59 GMT
WIth a (somewhat.  kinda.  sorta.  maybe.) similar requirement, I ended up doing this as follows
	(1) created a 'daily' database, that data got dumped into in very small increments - approximately
5 docs/second
	(2) uni-directionally replicated the documents out of this database into a 'reporting' database
that I could suck data out of
	(3) sucked data out of the reporting database at 15 minute intervals, processed them somewhat,
and dumped all of *those* into one single (highly sharded) bigcouch db
	
The advantages here were
	- My data was captured in the format best suited for the data generating events (minimum
processing of the event data) thanx to (1)
	- The processing of this data did not impact the writing of the data thanx to (2) allowing
for maximum throughput
	- I could compact and archive the 'daily' database every day, thus significantly minimizing
disk space thanx to (1). Also, We only retain the 'daily' data for 3 months, since anything
beyond that is stale (for our purposes. YMMV)
	- The collated data that ends up in bigcouch per (3) is much *much* smaller. But, if we ended
up needing a different collation (and yes, that happens every now and then), I can just rerun
the reporting process (up to the last 3 months of course).  In fact, I can have multiple collations
running in parallel...

Hope this helps. If you need more info, just ping me...

Cheers

Mahesh Paolini-Subramanya
That Tall Bald Indian Guy...
Google+  | Blog   | Twitter

On Jan 11, 2012, at 4:13 AM, Martin Hewitt wrote:

> Hi all,
> 
> I'm currently scoping a project which will measure a variety of indicators over a long
period, and I'm trying to work out where to strike the balance of document number vs document
size.
> 
> I could have one document per metric, leading to a small number of documents, but with
each document containing ticks for every 5-second interval of any given day, these documents
would quickly become huge. 
> 
> Clearly, I could decompose these huge per-metric documents down into smaller documents,
and I'm in the fortunate position that, because I'm dealing with time, I can decompose by
year, months, day, hour, minute or even second.
> 
> Going all the way to second-level would clearly create a huge number of documents, but
all of very small size, so that's the other extreme.
> 
> I'm aware the usual response to this is "somewhere in the middle", which is my working
hypothesis (decomposing to a "day" level), but I was wondering if there was a) anything in
CouchDB's architecture that would make one side of the "middle" more suited, or b) if someone
has experience architecting something like this.
> 
> Any help gratefully appreciated.
> 
> Martin


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message