incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: View Performance (was Re: The 1.0 Thread)
Date Sun, 05 Jul 2009 22:03:47 GMT
On Sun, Jul 5, 2009 at 11:52 AM, Scott Shumaker<sshumaker@gmail.com> wrote:
> Every document in our database is in a view.  We have a wide variety
> of different documents - but none of them constitute the majority of
> the docs in our database.  In a single design doc, we can't use view
> filtering - which means our performance is far worse (not to mention
> that we have nearly 100 views, so every view request will have to run
> through 100 javascript functions - some of which are quite expensive -
> and are used for offline (batch) processing only).
>

Perhaps you've got more than one application there. In that case you
could split up your views into a small handful of design docs. The
mechanics of the inter-process communication mean that grouping your
views uses less i/o, so the more views you can cram into each design
doc, the better, although offline batch stuff should be in it's own
doc.

Unless your writes are coming so fast the view engine can't possibly
keep up, you might do well to use a cron job to query index generation
periodically, so that users aren't faced with a lot of indexing to
wait for. Putting the view indexes on a different physical disk will
make a very big difference in overall performance.

>
> It may very well be that the erlang view engine will help - since it
> will cut down on the JSON -> erlang serialization, not to mention have
> a far more efficient transport protocol.

I think Erlang views will make a big difference for you because of the
size of your objects and the possibility to avoid serialization
overhead. We've clocked them at 2-10x faster which makes a difference.

> That said, here is almost certainly
> also a far more efficient communication protocol for talking to
> couchjs than just communicating over stdin and stdout - not to mention
> some ways to avoid the JSON -> erlang cost.  :)

Patches are definitely welcome.

You could get more view performance by running CouchDB on a
CouchDB-Lounge cluster.

http://code.google.com/p/couchdb-lounge/



>
> On Sun, Jul 5, 2009 at 11:02 AM, Chris Anderson<jchris@apache.org> wrote:
>> On Sat, Jul 4, 2009 at 3:26 PM, Scott Shumaker<sshumaker@gmail.com> wrote:
>>> Ok - here's some more detailed stats:
>>>
>>> Note that this is couch-0.9.0 with hipe enabled and the filter patch,
>>> on my macbook pro.
>>>
>>> ~53K db documents, ~1500 are type:restaurant
>>>
>>
>> Use of the design doc filter for view performance should be considered
>> a smell. Let me see if I understand the scenario:
>>
>> You have a few restaurant docs in a big database, and you've got views
>> to find them.
>>
>> Do you have other views? They should be consolidated into a single
>> design document when possible.
>>
>> Are there documents in your database that are not in views at all?
>>
>> If you have say 1500 restaurants and 50k log entries in the same
>> database, use two databases. If you have 1500 restaurants and 1500
>> coffee shops and 1500 bars then you should consolidate your views into
>> one design doc. Once you've properly relaxed your problems should be
>> less acute.
>>
>> Thanks for the numbers. We think getting an Erlang view engine
>> installed will make a difference, maybe even with the couchjs stuff,
>> as we get more concurrent.
>>
>> Chris
>>
>>
>>> We tested using Brian's bork.rb:
>>>
>>> no filtering:
>>>
>>> bork.rb - returning no values = 68s
>>> bork.rb - returning 5 values per map(doc) call = 200s
>>> couchjs - returning no values = 93s
>>> couchjs - one doc emitted per type:restaurant = 104s
>>>
>>> w/ filtering: (select ~1500 docs out of 53K)
>>>
>>> couchjs - returning no values = 8.9s
>>> couchjs - one doc emitted per type:restaurant = 19s
>>>
>>>
>>> Couple of notes:
>>>
>>> 53K docs apparently take 68s to be converted to JSON, and received by
>>> the dummy server (with no docs emitted) - or about 780 docs/second.
>>> couchjs is slower than bork.rb in this case (unsurprising -  bork.rb
>>> not really parsing the data)
>>> filtering on the couch side is an enormous win for our test case.
>>>
>>> K/V inserts - (5*53K in (200-68)s) = ~2000 per second
>>>
>>> This is a pretty big difference from Brian's results (8000/sec),
>>> although we're dealing with many more docs, and without comparing
>>> hardware specs, it's difficult to draw conclusions.
>>>
>>> On Sat, Jul 4, 2009 at 11:39 AM, Scott Shumaker<sshumaker@gmail.com> wrote:
>>>> Compiling with HiPE didn't seem to make any difference in performance.  :(
>>>>
>>>> On Thu, Jul 2, 2009 at 4:17 PM, Scott Shumaker<sshumaker@gmail.com>
wrote:
>>>>> I'll try that out tomorrow and post the results here.
>>>>>
>>>>> On Thu, Jul 2, 2009 at 3:01 PM, Paul Davis<paul.joseph.davis@gmail.com>
wrote:
>>>>>> On Thu, Jul 2, 2009 at 5:50 PM, Scott Shumaker<sshumaker@gmail.com>
wrote:
>>>>>>> One question, though: Why are the emitted view results stored
as
>>>>>>> erlang terms, as opposed to storing the JSON returned from the
view
>>>>>>> server - which is what you'll be serving to the clients anyway?
>>>>>>>
>>>>>>> If you skipped the reverse json->erlang encoding, and additionally
>>>>>>> stored a cached json copy of each document alongside the document
>>>>>>> whenever a document in couchdb was created/updated (which you
could
>>>>>>> incrementally generate in a separate erlang process so you don't
have
>>>>>>> to slow down write performance) - and just pass this json copy
to the
>>>>>>> view, you could basically eliminate the json->erlang conversion
>>>>>>> overhead entirely (since it would only be done asynchronously).
>>>>>>>
>>>>>>> Even if you need to store the emitted view results back into
erlang,
>>>>>>> you could have a special optimization case for emitting (key,
doc) -
>>>>>>> because you already have the document as both erlang/json (assuming
>>>>>>> you were storing cached json copies).  And include_docs would
get
>>>>>>> faster since you wouldn't need to do the json conversion there
either.
>>>>>>>
>>>>>>> Just a thought.
>>>>>>>
>>>>>>
>>>>>> Premature optimization is the root of all evil? Have you tried
>>>>>> compiling CouchDB with HiPE enabled. I'm inclined to agree with you
>>>>>> that the large JSON values are probably a significant cause here.
>>>>>> Assuming your Erlang is HiPE enabled you can do something like this
to
>>>>>> compile CouchDB:
>>>>>>
>>>>>>    $ ./bootstrap
>>>>>>    $ ERLC_FLAGS="+native +inline +inline_list_funcs" ./configure
>>>>>>    $ make
>>>>>>    $ sudo make install
>>>>>>
>>>>>>
>>>>>>> Scott
>>>>>>>
>>>>>>> On Thu, Jul 2, 2009 at 2:42 PM, Scott Shumaker<sshumaker@gmail.com>
wrote:
>>>>>>>> I should mention that we tend to emit (doc._id, doc) in our
views - as
>>>>>>>> opposed to doc._id, null and using include_docs - because
we found
>>>>>>>> that doc._id,null gave us a 30% speedup on building the views,
but
>>>>>>>> cost us about the same on each additional hit to the view.
>>>>>>>>
>>>>>>>> Scott
>>>>>>>>
>>>>>>>> On Thu, Jul 2, 2009 at 2:15 PM, Scott Shumaker<sshumaker@gmail.com>
wrote:
>>>>>>>>> We see times that are considerably worse.  We mostly
have maps - very
>>>>>>>>> few reduces.  We have 40k objects, about 25 design docs,
and 90 views.
>>>>>>>>>  Although we're about to change the code to auto-generate
the design
>>>>>>>>> docs based on the view filters used (re: view filter
patch) - see if
>>>>>>>>> that helps.
>>>>>>>>>
>>>>>>>>> Maybe it's because we have larger objects - but re-indexing
a typical
>>>>>>>>> new view takes > 5 minutes (with view filtering off).
 Some are worse.
>>>>>>>>>  With view filtering on some can be quite fast - some
views finish in
>>>>>>>>> like 10 seconds.  Interestingly, reindexing all views
takes about an
>>>>>>>>> hour - with or without view filtering.  I'm guessing
that a
>>>>>>>>> substantial part of the bottleneck is erlang -> json
serialization.
>>>>>>>>> Many of our objects are heavily nested structures and
exceed 10k in
>>>>>>>>> size.  One other note - when we tried dropping in the
optimized
>>>>>>>>> 'main.js' posted on the mailing list, we saw an overall
20% speedup.
>>>>>>>>> Unfortunately, it wasn't compatible with the authentication
stuff, and
>>>>>>>>> the deployment was a bit wacky, so we're holding off
on that right
>>>>>>>>> now.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 2, 2009 at 11:30 AM, Damien Katz<damien@apache.org>
wrote:
>>>>>>>>>>
>>>>>>>>>> On Jul 2, 2009, at 1:55 PM, Paul Davis wrote:
>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 2, 2009 at 1:29 PM, Damien Katz<damien@apache.org>
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Jul 2, 2009, at 1:16 PM, Jason Davies
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 2 Jul 2009, at 15:38, Brian Candler
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> For some fruit that was so low-hanging
that I nearly stubbed my toe on
>>>>>>>>>>>>>> it,
>>>>>>>>>>>>>> see https://issues.apache.org/jira/browse/COUCHDB-399
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nice work!  I'd be interested to see
what kind of performance increase
>>>>>>>>>>>>> we
>>>>>>>>>>>>> get from Spidermonkey 1.8.1, which comes
with native JSON
>>>>>>>>>>>>> parsing/encoding.
>>>>>>>>>>>>>  See here for details:
>>>>>>>>>>>>> https://developer.mozilla.org/En/Using_native_JSON .
>>>>>>>>>>>>>
>>>>>>>>>>>>> Rumour has it 1.8.1 will be released
any time soon (TM)
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure the new engine is such a no-brainer.
One thing about the new
>>>>>>>>>>>> generation of JS VMs is we've seen greatly
increased memory usage with
>>>>>>>>>>>> earlier versions. Also the startup times
might be longer, or shorter.
>>>>>>>>>>>>
>>>>>>>>>>>> Though I wonder if this can be improved by
forking a JS process rather
>>>>>>>>>>>> than
>>>>>>>>>>>> spawning a new process.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Memory usage is a definite concern. I'm not sure
I follow why startup
>>>>>>>>>>> times would be important though. Am I missing
something?
>>>>>>>>>>
>>>>>>>>>> Start up time isn't a huge concern, but it's is a
something to consider. On
>>>>>>>>>> a heavily loaded system, scripts that normally work
might start to time out,
>>>>>>>>>> requiring restarting the process. Lots of restarts
may start to eat lots cpu
>>>>>>>>>> and memory IO.
>>>>>>>>>>
>>>>>>>>>> -Damien
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -Damien
>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jason Davies
>>>>>>>>>>>>>
>>>>>>>>>>>>> www.jasondavies.com
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Chris Anderson
>> http://jchrisa.net
>> http://couch.io
>>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message