couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: getting most recent doc
Date Mon, 19 Apr 2010 14:22:42 GMT
On Apr 19, 2010, at 10:10 AM, Eric Casteleijn wrote:

> On 04/19/2010 09:41 AM, Adam Kocoloski wrote:
>> On Apr 17, 2010, at 11:09 AM, Eric Casteleijn wrote:
>> 
>>> On 04/16/2010 04:46 AM, wolfgang haefelinger wrote:
>>>> Thanks Robert
>>>> 
>>>> for your answer. However, it is not exactly what I was looking for
>>>> (due to my inappropriate problem description).
>>>> 
>>>> Firstly, I do want to have the document instead of the time stamp in
>>>> order to avoid that additional document fetch. That's obviously easy
>>>> to fix:
>>>> 
>>>> function(doc) { //
>>>>  emit([doc.name, doc.timestamp], doc);
>>>> }
>>> 
>>> Don't do that, it's unnecessary, because you can always call any view with '?include_docs=true'
and it will add a 'doc' member to each row, containing the document, and worse than that,
it's harmful, as it makes the indexes stored on disk many times larger than they need to be.
(Depending on the size of your documents this can really make a huge difference, anecdotal
evidence suggests: gwibber used to do this, and when I changed it, the indexes stored on disk
decreased some 90% in size.)
>>> 
>>> If you always want the whole document, just emit null for a value and always
call the view with include_docs.
>>> 
>>> If there are cases where you don't want the whole document, decide which data
you need and only emit that.
>> 
>> Hi Eric, I don't think its correct to have a blanket recommendation to always use
include_docs=true.  For large range queries on a view the query performance will be much better
- up to 10x better throughput on large DBs in my experience - if the doc is already included.
 Yes, the view index will balloon in size, but some people may be willing to make that tradeoff.
 Cheers,
> 
> Oops, thanks for catching that Adam, and my apologies, that was rather myopic. I didn't
think about the other side of the tradeoff, but that makes a lot of sense.
> 
> I still wonder in that case if there is something you can do to shrink the stored views
somewhat: gwibber had a number of views that emitted the whole document, but those documents
(typically representing a twitter or identi.ca message) weren't very large in themselves.
My database, after compaction was something between 70 and 80 MB, whereas the indexes took
over a GB. Since gwibber+desktopcouch run on the desktop, where only one client typically
talks to couch, I still think we made the right decision to sacrifice speed for diskspace.
On a server, both are important though, considering we host multiple couchdbs per user. Luckily
we don't compute the views for the gwibber dbs server side, but I'm sure it's something we'll
run into again elsewhere.
> 

Were the view indices also compacted?  If so, that's very surprising to me.  I should double-check
our numbers, but I seem to remember the compacted view indices for our case (which had similarly-sized
documents) being comparable in size to the DBs.

There are a few things we can do to decrease the size of uncompacted view indices.  Chief
among those is to put a lower bound on the size of a view index write, as reported by Henrik
Jensen last month (COUCHDB-700).  Cheers,

Adam

> -- 
> eric casteleijn
> https://code.launchpad.net/~thisfred
> Canonical Ltd.
> 


Mime
View raw message