incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Cottlehuber <d...@muse.net.nz>
Subject Re: Performance of many documents vs large documents
Date Wed, 11 Jan 2012 11:51:15 GMT
On 11 January 2012 04:07, Mahesh Paolini-Subramanya <mahesh@aptela.com> wrote:
> WIth a (somewhat.  kinda.  sorta.  maybe.) similar requirement, I ended up doing this
as follows
>        (1) created a 'daily' database, that data got dumped into in very small increments
- approximately 5 docs/second
>        (2) uni-directionally replicated the documents out of this database into a
'reporting' database that I could suck data out of
>        (3) sucked data out of the reporting database at 15 minute intervals, processed
them somewhat, and dumped all of *those* into one single (highly sharded) bigcouch db
>
> The advantages here were
>        - My data was captured in the format best suited for the data generating events
(minimum processing of the event data) thanx to (1)
>        - The processing of this data did not impact the writing of the data thanx
to (2) allowing for maximum throughput
>        - I could compact and archive the 'daily' database every day, thus significantly
minimizing disk space thanx to (1). Also, We only retain the 'daily' data for 3 months, since
anything beyond that is stale (for our purposes. YMMV)
>        - The collated data that ends up in bigcouch per (3) is much *much* smaller.
But, if we ended up needing a different collation (and yes, that happens every now and then),
I can just rerun the reporting process (up to the last 3 months of course).  In fact, I can
have multiple collations running in parallel...
>
> Hope this helps. If you need more info, just ping me...
>
> Cheers
>
> Mahesh Paolini-Subramanya
> That Tall Bald Indian Guy...
> Google+  | Blog   | Twitter
>
> On Jan 11, 2012, at 4:13 AM, Martin Hewitt wrote:
>
>> Hi all,
>>
>> I'm currently scoping a project which will measure a variety of indicators over a
long period, and I'm trying to work out where to strike the balance of document number vs
document size.
>>
>> I could have one document per metric, leading to a small number of documents, but
with each document containing ticks for every 5-second interval of any given day, these documents
would quickly become huge.
>>
>> Clearly, I could decompose these huge per-metric documents down into smaller documents,
and I'm in the fortunate position that, because I'm dealing with time, I can decompose by
year, months, day, hour, minute or even second.
>>
>> Going all the way to second-level would clearly create a huge number of documents,
but all of very small size, so that's the other extreme.
>>
>> I'm aware the usual response to this is "somewhere in the middle", which is my working
hypothesis (decomposing to a "day" level), but I was wondering if there was a) anything in
CouchDB's architecture that would make one side of the "middle" more suited, or b) if someone
has experience architecting something like this.
>>
>> Any help gratefully appreciated.
>>
>> Martin
>

Simon & Mahesh,

These examples would be a great addition to the wiki :-))

A+
Dave

Mime
View raw message