incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: A simple comparative benchmark and look at include_docs performance.
Date Tue, 20 Apr 2010 20:05:33 GMT
On Apr 20, 2010, at 3:51 PM, Chris Stockton wrote:

> Hello,
> 
> On Tue, Apr 20, 2010 at 10:58 AM, Adam Kocoloski <kocolosk@apache.org> wrote:
>> Hi Chris, for the type of access pattern in your benchmark I generally recommend
to use emit(doc.model, doc) and avoid include_docs=true.  include_docs introduces an extra
lookup back in the DB for every row of your view.  If you emit the document into the view
index the index will get large but streaming requests such as yours can be accomplished with
a minimum of disk IO.
> 
> We have tried this approach and it was indeed faster, however we wound
> up with what I remember to be over 19G view file. For 300mb sized
> database this trade off did not seem reasonable, although disk is
> cheaper in many cases, we found the bloat to be unacceptable. Do you
> know of a way to limit the size of the view when including the doc?
> Additionally may I ask if include_docs = true has potential room for
> optimization?

You should make sure to compact the view index, it doesn't take too long and can offer some
huge space savings (as well as better query time performance).  You should expect the view
index to be comparable in size to the DB in that case.

include_docs=true uses the same code path as a single-document GET request.  I'm not aware
of any extreme hot spots there.  Of course we're always keeping an eye on performance and
looking out for optimizations, especially as CouchDB stabilizes and heads towards 1.0.

>> On the other hand, your sar report shows negligible iowait, so perhaps that's not
your immediate problem.  It may be the case that you're CPU-limited in the (pure Erlang) JSON
encoder, although I would've expected JSON encoding CPU usage to scale with network traffic.
> 
> It would surprise me if 13mb of json encoding could cause such spikes
> in CPU. I also expected network traffic to scale with our CPU usage.
> Have you seen issues in this area before? At first thought I would
> think of the encoding stage as being one of the lighter areas in the
> request, given the simple nature of json.

A thought -- do your documents have a very large number of edits?  I _have_ seen heavy CPU
utilization when dealing with documents containing 100+ revisions.  Even after those revisions
have been compacted away, the revision tree hangs around and is processed for every single-document
(and include_docs) request.  I've profiled couch_key_tree as a significant bottleneck in that
case.

If you do have a large number of revisions and you don't worry too much about spurious conflicts
on replication, you can lower the _revs_limit setting for your DB to trim that history down
a bit.  The default value is 1000 revisions.

>> You might try running eprof while you do this test.  It's quite heavyweight and will
slow your system down.  If you start couchdb with the -i flag you can get an Erlang shell
and execute
>> <snip>
> 
> This was good information and I will look into profiling with erlang.
> May I ask if any effort is currently being put into performance and
> optimization for couchdb?

Yes, all the time, particularly as the codebase stabilizes.  I've submitted performance-related
patches for DB compaction and view key collation in the past week, for instance.

> I am also very interested in any reads on
> large-scall couchdb deployments, that are not so high-level (I.E.
> hardware specs, use cases, etc).

Ah, I'm not aware of so many low-level case studies like that.  We should get around to writing
up so of our accumulated experience at cloudant.com.  Cheers,

Adam

> 
> Kind Regards,
> 
> -Chris


Mime
View raw message