Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Message-ID: <4BCC643B.4030208@canonical.com>
Date: Mon, 19 Apr 2010 10:10:03 -0400
From: Eric Casteleijn <eric.casteleijn@canonical.com>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
 rv:1.9.1.9) Gecko/20100415 Lightning/1.0b1 Thunderbird/3.0.4
MIME-Version: 1.0
To: user@couchdb.apache.org
Subject: Re: getting most recent doc
References: <l2jccadeb881004151533p1f97af17r6c82f460d4f42398@mail.gmail.com>
	 <o2t46aeb24f1004151542y6fe2d765yd7d53de2231f4737@mail.gmail.com>
 <z2tccadeb881004160146x3b54e990s2a78979c3a9a33dd@mail.gmail.com>
 <4BC9CF46.6070006@canonical.com>
 <85340AE9-EBD2-440B-8538-37676DEA624B@apache.org>
In-Reply-To: <85340AE9-EBD2-440B-8538-37676DEA624B@apache.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

On 04/19/2010 09:41 AM, Adam Kocoloski wrote:
> On Apr 17, 2010, at 11:09 AM, Eric Casteleijn wrote:
>
>> On 04/16/2010 04:46 AM, wolfgang haefelinger wrote:
>>> Thanks Robert
>>>
>>> for your answer. However, it is not exactly what I was looking for
>>> (due to my inappropriate problem description).
>>>
>>> Firstly, I do want to have the document instead of the time stamp in
>>> order to avoid that additional document fetch. That's obviously easy
>>> to fix:
>>>
>>> function(doc) { //
>>>   emit([doc.name, doc.timestamp], doc);
>>> }
>>
>> Don't do that, it's unnecessary, because you can always call any view with '?include_docs=true' and it will add a 'doc' member to each row, containing the document, and worse than that, it's harmful, as it makes the indexes stored on disk many times larger than they need to be. (Depending on the size of your documents this can really make a huge difference, anecdotal evidence suggests: gwibber used to do this, and when I changed it, the indexes stored on disk decreased some 90% in size.)
>>
>> If you always want the whole document, just emit null for a value and always call the view with include_docs.
>>
>> If there are cases where you don't want the whole document, decide which data you need and only emit that.
>
> Hi Eric, I don't think its correct to have a blanket recommendation to always use include_docs=true.  For large range queries on a view the query performance will be much better - up to 10x better throughput on large DBs in my experience - if the doc is already included.  Yes, the view index will balloon in size, but some people may be willing to make that tradeoff.  Cheers,

Oops, thanks for catching that Adam, and my apologies, that was rather 
myopic. I didn't think about the other side of the tradeoff, but that 
makes a lot of sense.

I still wonder in that case if there is something you can do to shrink 
the stored views somewhat: gwibber had a number of views that emitted 
the whole document, but those documents (typically representing a 
twitter or identi.ca message) weren't very large in themselves. My 
database, after compaction was something between 70 and 80 MB, whereas 
the indexes took over a GB. Since gwibber+desktopcouch run on the 
desktop, where only one client typically talks to couch, I still think 
we made the right decision to sacrifice speed for diskspace. On a 
server, both are important though, considering we host multiple couchdbs 
per user. Luckily we don't compute the views for the gwibber dbs server 
side, but I'm sure it's something we'll run into again elsewhere.

-- 
eric casteleijn
https://code.launchpad.net/~thisfred
Canonical Ltd.