couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Goodall <matt.good...@gmail.com>
Subject Re: batch=ok for bulk_docs and single doc implementation concerns
Date Wed, 14 Apr 2010 13:38:06 GMT
On 14 April 2010 13:23, Adam Kocoloski <kocolosk@apache.org> wrote:
> On Apr 14, 2010, at 7:59 AM, Matt Goodall wrote:
>
>> Hi,
>>
>> Over in couchdb-python land someone wanted to use batch=ok when
>> creating and updating documents, so we added support.
>>
>> I was semi-surprised to notice that _bulk_docs does not support
>> batch=ok. I realise _bulk_docs essentially is a batch update but a
>> _bulk_docs batch=ok would presumably allow CouchDB to buffer more in
>> memory before writing to disk. What are your thoughts?
>
> Its probably of limited utility.  If you're already batching on the client side, you
can achieve the same effect by sending in a larger batch.  I'm not opposed to it per se,
just don't think it will help with throughput all that much.

:nod: given the new behaviour I'm inclined to agree.

>
>>
>> Now, this buffering is where the "implementation concerns" come in.
>> According to the wiki:
>>
>> "There is a query option batch=ok which can be used to achieve higher
>> throughput at the cost of lower guarantees. When a PUT (or a document
>> POST as described below) is sent using this option, it is not
>> immediately written to disk. Instead it is stored in memory on a
>> per-user basis for a second or so (or the number of docs in memory
>> reaches a certain point). After the threshold has passed, the docs are
>> committed to disk."
>>
>> However, unless I'm missing something (quite likely ;-)), there is no
>> "stored in memory on a per-user basis" or any check for when "the
>> number of docs in memory reaches a certain point". All it seems to do
>> is spawn a new process so the update happens when the Erlang scheduler
>> gets around to it. In fact, I don't see any reference to the
>> batch_save_interval and batch_save_size configuration options in the
>> code.
>
> The wiki describes the 0.10 implementation of batch=ok.  In 0.11 batch mode takes advantage
of the fact that couch_db_updater always merges all waiting updates to a DB into a single
write, and so doesn't bother with the separate set of supervised processes accumulating documents.
 In effect the 0.11 batch=ok is "I'm not going to wait around, but save this as soon as you
get a chance".

Ah, I didn't dig far enough into the code to see that happening.

So, purely for my understanding, it's now simplified to a delayed
commit that happens at most 1000ms after normal changes are received.
Anything that causes the commit to happen earlier cancels the pending
commit.

Does that mean that batch="ok" with delayed_commits=false is meaningless?

Anyway, it sounds like the two batch_save config options should be
removed from etc/couchdb/default.ini.tpl.in.

>
> This does change the performance characteristics quite a bit; in particular, when the
underlying disk is fast the new batch=ok behavior will result in significantly larger uncompacted
databases.

Agh, this suggests I didn't understand the updater's behaviour. Large
uncompacted database normally means lots of small additions to the
database file. How does fast disk speed affect that?

>
>> Shouldn't batch=ok send the doc off to some background process that
>> accumulates docs until either the batch interval or size threshold has
>> been reached? This would also ensure that batch=ok updates are handled
>> in the order they arrive, although I'm not sure if that matters given
>> that the user has basically said they don't care if it succeeds or not
>> by using batch=ok.
>
> I think the documents updates are still handled in the order in which they were received.
>
>>
>> - Matt
>
>
> Best, Adam

Mime
View raw message