Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of adam.kocoloski@gmail.com
 designates 74.125.92.27 as permitted sender)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1078)
Subject: Re: batch=ok for bulk_docs and single doc implementation concerns
From: Adam Kocoloski <kocolosk@apache.org>
In-Reply-To: <j2q214c385b1004140459j4e4da1b8jf3afab5a6f5beddd@mail.gmail.com>
Date: Wed, 14 Apr 2010 08:23:45 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <7D9407D6-77B5-4AA1-9955-8E1E66424BC8@apache.org>
References: <j2q214c385b1004140459j4e4da1b8jf3afab5a6f5beddd@mail.gmail.com>
To: dev@couchdb.apache.org

On Apr 14, 2010, at 7:59 AM, Matt Goodall wrote:

> Hi,
>=20
> Over in couchdb-python land someone wanted to use batch=3Dok when
> creating and updating documents, so we added support.
>=20
> I was semi-surprised to notice that _bulk_docs does not support
> batch=3Dok. I realise _bulk_docs essentially is a batch update but a
> _bulk_docs batch=3Dok would presumably allow CouchDB to buffer more in
> memory before writing to disk. What are your thoughts?

Its probably of limited utility.  If you're already batching on the =
client side, you can achieve the same effect by sending in a larger =
batch.  I'm not opposed to it per se, just don't think it will help with =
throughput all that much.

>=20
> Now, this buffering is where the "implementation concerns" come in.
> According to the wiki:
>=20
> "There is a query option batch=3Dok which can be used to achieve =
higher
> throughput at the cost of lower guarantees. When a PUT (or a document
> POST as described below) is sent using this option, it is not
> immediately written to disk. Instead it is stored in memory on a
> per-user basis for a second or so (or the number of docs in memory
> reaches a certain point). After the threshold has passed, the docs are
> committed to disk."
>=20
> However, unless I'm missing something (quite likely ;-)), there is no
> "stored in memory on a per-user basis" or any check for when "the
> number of docs in memory reaches a certain point". All it seems to do
> is spawn a new process so the update happens when the Erlang scheduler
> gets around to it. In fact, I don't see any reference to the
> batch_save_interval and batch_save_size configuration options in the
> code.

The wiki describes the 0.10 implementation of batch=3Dok.  In 0.11 batch =
mode takes advantage of the fact that couch_db_updater always merges all =
waiting updates to a DB into a single write, and so doesn't bother with =
the separate set of supervised processes accumulating documents.  In =
effect the 0.11 batch=3Dok is "I'm not going to wait around, but save =
this as soon as you get a chance".

This does change the performance characteristics quite a bit; in =
particular, when the underlying disk is fast the new batch=3Dok behavior =
will result in significantly larger uncompacted databases.

> Shouldn't batch=3Dok send the doc off to some background process that
> accumulates docs until either the batch interval or size threshold has
> been reached? This would also ensure that batch=3Dok updates are =
handled
> in the order they arrive, although I'm not sure if that matters given
> that the user has basically said they don't care if it succeeds or not
> by using batch=3Dok.

I think the documents updates are still handled in the order in which =
they were received.

>=20
> - Matt


Best, Adam=