couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Shumaker <>
Subject Re: View Performance (was Re: The 1.0 Thread)
Date Sun, 05 Jul 2009 18:52:09 GMT
Every document in our database is in a view.  We have a wide variety
of different documents - but none of them constitute the majority of
the docs in our database.  In a single design doc, we can't use view
filtering - which means our performance is far worse (not to mention
that we have nearly 100 views, so every view request will have to run
through 100 javascript functions - some of which are quite expensive -
and are used for offline (batch) processing only).

"Once you've properly relaxed your problems should be less acute."

I'm not sure what you mean by that.  Our objects are quite different -
they don't make sense to be returned in the same view most of the time
(e.g. users, restaurants, and comments).  We considered splitting into
multiple databases, but this is a lot of work - especially since the
code often just has an id (no type) - and doesn't necessarily know
which database an object would reside in (we would probably have to
recreate all of our ids to add a db identifier) - including patching
all of the references contained within objects.  The view filter patch
takes advantage of this (since most views only touch a smaller portion
of our documents) without requiring refactoring or preventing us from
returning objects of different type in the same view (which we do
fairly frequently).

It may very well be that the erlang view engine will help - since it
will cut down on the JSON -> erlang serialization, not to mention have
a far more efficient transport protocol.  Although to this point we've
been able to write our views in javascript, which has been convenient
because our objects map directly to Javascript classes on the client
(plus the added benefit of keeping down the number of languages in our
codebase - vis a vis erlang).   That said, here is almost certainly
also a far more efficient communication protocol for talking to
couchjs than just communicating over stdin and stdout - not to mention
some ways to avoid the JSON -> erlang cost.  :)

On Sun, Jul 5, 2009 at 11:02 AM, Chris Anderson<> wrote:
> On Sat, Jul 4, 2009 at 3:26 PM, Scott Shumaker<> wrote:
>> Ok - here's some more detailed stats:
>> Note that this is couch-0.9.0 with hipe enabled and the filter patch,
>> on my macbook pro.
>> ~53K db documents, ~1500 are type:restaurant
> Use of the design doc filter for view performance should be considered
> a smell. Let me see if I understand the scenario:
> You have a few restaurant docs in a big database, and you've got views
> to find them.
> Do you have other views? They should be consolidated into a single
> design document when possible.
> Are there documents in your database that are not in views at all?
> If you have say 1500 restaurants and 50k log entries in the same
> database, use two databases. If you have 1500 restaurants and 1500
> coffee shops and 1500 bars then you should consolidate your views into
> one design doc. Once you've properly relaxed your problems should be
> less acute.
> Thanks for the numbers. We think getting an Erlang view engine
> installed will make a difference, maybe even with the couchjs stuff,
> as we get more concurrent.
> Chris
>> We tested using Brian's bork.rb:
>> no filtering:
>> bork.rb - returning no values = 68s
>> bork.rb - returning 5 values per map(doc) call = 200s
>> couchjs - returning no values = 93s
>> couchjs - one doc emitted per type:restaurant = 104s
>> w/ filtering: (select ~1500 docs out of 53K)
>> couchjs - returning no values = 8.9s
>> couchjs - one doc emitted per type:restaurant = 19s
>> Couple of notes:
>> 53K docs apparently take 68s to be converted to JSON, and received by
>> the dummy server (with no docs emitted) - or about 780 docs/second.
>> couchjs is slower than bork.rb in this case (unsurprising -  bork.rb
>> not really parsing the data)
>> filtering on the couch side is an enormous win for our test case.
>> K/V inserts - (5*53K in (200-68)s) = ~2000 per second
>> This is a pretty big difference from Brian's results (8000/sec),
>> although we're dealing with many more docs, and without comparing
>> hardware specs, it's difficult to draw conclusions.
>> On Sat, Jul 4, 2009 at 11:39 AM, Scott Shumaker<> wrote:
>>> Compiling with HiPE didn't seem to make any difference in performance.  :(
>>> On Thu, Jul 2, 2009 at 4:17 PM, Scott Shumaker<> wrote:
>>>> I'll try that out tomorrow and post the results here.
>>>> On Thu, Jul 2, 2009 at 3:01 PM, Paul Davis<>
>>>>> On Thu, Jul 2, 2009 at 5:50 PM, Scott Shumaker<>
>>>>>> One question, though: Why are the emitted view results stored as
>>>>>> erlang terms, as opposed to storing the JSON returned from the view
>>>>>> server - which is what you'll be serving to the clients anyway?
>>>>>> If you skipped the reverse json->erlang encoding, and additionally
>>>>>> stored a cached json copy of each document alongside the document
>>>>>> whenever a document in couchdb was created/updated (which you could
>>>>>> incrementally generate in a separate erlang process so you don't
>>>>>> to slow down write performance) - and just pass this json copy to
>>>>>> view, you could basically eliminate the json->erlang conversion
>>>>>> overhead entirely (since it would only be done asynchronously).
>>>>>> Even if you need to store the emitted view results back into erlang,
>>>>>> you could have a special optimization case for emitting (key, doc)
>>>>>> because you already have the document as both erlang/json (assuming
>>>>>> you were storing cached json copies).  And include_docs would get
>>>>>> faster since you wouldn't need to do the json conversion there either.
>>>>>> Just a thought.
>>>>> Premature optimization is the root of all evil? Have you tried
>>>>> compiling CouchDB with HiPE enabled. I'm inclined to agree with you
>>>>> that the large JSON values are probably a significant cause here.
>>>>> Assuming your Erlang is HiPE enabled you can do something like this to
>>>>> compile CouchDB:
>>>>>    $ ./bootstrap
>>>>>    $ ERLC_FLAGS="+native +inline +inline_list_funcs" ./configure
>>>>>    $ make
>>>>>    $ sudo make install
>>>>>> Scott
>>>>>> On Thu, Jul 2, 2009 at 2:42 PM, Scott Shumaker<>
>>>>>>> I should mention that we tend to emit (doc._id, doc) in our views
- as
>>>>>>> opposed to doc._id, null and using include_docs - because we
>>>>>>> that doc._id,null gave us a 30% speedup on building the views,
>>>>>>> cost us about the same on each additional hit to the view.
>>>>>>> Scott
>>>>>>> On Thu, Jul 2, 2009 at 2:15 PM, Scott Shumaker<>
>>>>>>>> We see times that are considerably worse.  We mostly have
maps - very
>>>>>>>> few reduces.  We have 40k objects, about 25 design docs,
and 90 views.
>>>>>>>>  Although we're about to change the code to auto-generate
the design
>>>>>>>> docs based on the view filters used (re: view filter patch)
- see if
>>>>>>>> that helps.
>>>>>>>> Maybe it's because we have larger objects - but re-indexing
a typical
>>>>>>>> new view takes > 5 minutes (with view filtering off).
 Some are worse.
>>>>>>>>  With view filtering on some can be quite fast - some views
finish in
>>>>>>>> like 10 seconds.  Interestingly, reindexing all views takes
about an
>>>>>>>> hour - with or without view filtering.  I'm guessing that
>>>>>>>> substantial part of the bottleneck is erlang -> json serialization.
>>>>>>>> Many of our objects are heavily nested structures and exceed
10k in
>>>>>>>> size.  One other note - when we tried dropping in the optimized
>>>>>>>> 'main.js' posted on the mailing list, we saw an overall 20%
>>>>>>>> Unfortunately, it wasn't compatible with the authentication
stuff, and
>>>>>>>> the deployment was a bit wacky, so we're holding off on that
>>>>>>>> now.
>>>>>>>> On Thu, Jul 2, 2009 at 11:30 AM, Damien Katz<>
>>>>>>>>> On Jul 2, 2009, at 1:55 PM, Paul Davis wrote:
>>>>>>>>>> On Thu, Jul 2, 2009 at 1:29 PM, Damien Katz<>
>>>>>>>>>>> On Jul 2, 2009, at 1:16 PM, Jason Davies wrote:
>>>>>>>>>>>> On 2 Jul 2009, at 15:38, Brian Candler wrote:
>>>>>>>>>>>>> For some fruit that was so low-hanging
that I nearly stubbed my toe on
>>>>>>>>>>>>> it,
>>>>>>>>>>>>> see
>>>>>>>>>>>> Nice work!  I'd be interested to see what
kind of performance increase
>>>>>>>>>>>> we
>>>>>>>>>>>> get from Spidermonkey 1.8.1, which comes
with native JSON
>>>>>>>>>>>> parsing/encoding.
>>>>>>>>>>>>  See here for details:
>>>>>>>>>>>> .
>>>>>>>>>>>> Rumour has it 1.8.1 will be released any
time soon (TM)
>>>>>>>>>>> I'm not sure the new engine is such a no-brainer.
One thing about the new
>>>>>>>>>>> generation of JS VMs is we've seen greatly increased
memory usage with
>>>>>>>>>>> earlier versions. Also the startup times might
be longer, or shorter.
>>>>>>>>>>> Though I wonder if this can be improved by forking
a JS process rather
>>>>>>>>>>> than
>>>>>>>>>>> spawning a new process.
>>>>>>>>>> Memory usage is a definite concern. I'm not sure
I follow why startup
>>>>>>>>>> times would be important though. Am I missing something?
>>>>>>>>> Start up time isn't a huge concern, but it's is a something
to consider. On
>>>>>>>>> a heavily loaded system, scripts that normally work might
start to time out,
>>>>>>>>> requiring restarting the process. Lots of restarts may
start to eat lots cpu
>>>>>>>>> and memory IO.
>>>>>>>>> -Damien
>>>>>>>>>>> -Damien
>>>>>>>>>>>> --
>>>>>>>>>>>> Jason Davies
> --
> Chris Anderson

View raw message