Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of dave@muse.net.nz designates
 209.85.212.52 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <46A31F28-2CD7-4F44-8A8D-9248C2A47490@aptela.com>
References: <20E6547A-058F-4054-B7B1-DED572DCB61E@thenoi.se>
	<46A31F28-2CD7-4F44-8A8D-9248C2A47490@aptela.com>
Date: Wed, 11 Jan 2012 12:51:15 +0100
Message-ID: 
 <CAHZBNKaTbDGJ54c4A4h2H_SdfJcaDRGjKNO-5qF2iEP7ETbSrQ@mail.gmail.com>
Subject: Re: Performance of many documents vs large documents
From: Dave Cottlehuber <dave@muse.net.nz>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On 11 January 2012 04:07, Mahesh Paolini-Subramanya <mahesh@aptela.com> wro=
te:
> WIth a (somewhat. =A0kinda. =A0sorta. =A0maybe.) similar requirement, I e=
nded up doing this as follows
> =A0 =A0 =A0 =A0(1) created a 'daily' database, that data got dumped into =
in very small increments - approximately 5 docs/second
> =A0 =A0 =A0 =A0(2) uni-directionally replicated the documents out of this=
 database into a 'reporting' database that I could suck data out of
> =A0 =A0 =A0 =A0(3) sucked data out of the reporting database at 15 minute=
 intervals, processed them somewhat, and dumped all of *those* into one sin=
gle (highly sharded) bigcouch db
>
> The advantages here were
> =A0 =A0 =A0 =A0- My data was captured in the format best suited for the d=
ata generating events (minimum processing of the event data) thanx to (1)
> =A0 =A0 =A0 =A0- The processing of this data did not impact the writing o=
f the data thanx to (2) allowing for maximum throughput
> =A0 =A0 =A0 =A0- I could compact and archive the 'daily' database every d=
ay, thus significantly minimizing disk space thanx to (1). Also, We only re=
tain the 'daily' data for 3 months, since anything beyond that is stale (fo=
r our purposes. YMMV)
> =A0 =A0 =A0 =A0- The collated data that ends up in bigcouch per (3) is mu=
ch *much* smaller. But, if we ended up needing a different collation (and y=
es, that happens every now and then), I can just rerun the reporting proces=
s (up to the last 3 months of course). =A0In fact, I can have multiple coll=
ations running in parallel...
>
> Hope this helps. If you need more info, just ping me...
>
> Cheers
>
> Mahesh Paolini-Subramanya
> That Tall Bald Indian Guy...
> Google+ =A0| Blog =A0 | Twitter
>
> On Jan 11, 2012, at 4:13 AM, Martin Hewitt wrote:
>
>> Hi all,
>>
>> I'm currently scoping a project which will measure a variety of indicato=
rs over a long period, and I'm trying to work out where to strike the balan=
ce of document number vs document size.
>>
>> I could have one document per metric, leading to a small number of docum=
ents, but with each document containing ticks for every 5-second interval o=
f any given day, these documents would quickly become huge.
>>
>> Clearly, I could decompose these huge per-metric documents down into sma=
ller documents, and I'm in the fortunate position that, because I'm dealing=
 with time, I can decompose by year, months, day, hour, minute or even seco=
nd.
>>
>> Going all the way to second-level would clearly create a huge number of =
documents, but all of very small size, so that's the other extreme.
>>
>> I'm aware the usual response to this is "somewhere in the middle", which=
 is my working hypothesis (decomposing to a "day" level), but I was wonderi=
ng if there was a) anything in CouchDB's architecture that would make one s=
ide of the "middle" more suited, or b) if someone has experience architecti=
ng something like this.
>>
>> Any help gratefully appreciated.
>>
>> Martin
>

Simon & Mahesh,

These examples would be a great addition to the wiki :-))

A+
Dave