Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of b.candler@pobox.com designates
 208.72.237.25 as permitted sender)
Date: Wed, 5 Aug 2009 11:54:48 +0100
From: Brian Candler <B.Candler@pobox.com>
To: "Jan Lehnardt (JIRA)" <jira@apache.org>
Cc: dev@couchdb.apache.org
Subject: Re: [jira] Commented: (COUCHDB-449) Turn off delayed commits by
 default
Message-ID: <20090805105448.GA9881@uk.tiscali.com>
References: <583082672.1249467074859.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <583082672.1249467074859.JavaMail.jira@brutus>
User-Agent: Mutt/1.5.17+20080114 (2008-01-14)

On Wed, Aug 05, 2009 at 03:11:14AM -0700, Jan Lehnardt (JIRA) wrote:
> [delayed_commits]
> dbname = true
> dbname2  = false
> ...
> ...
> 
> so you can have a "safe" db for your app and a "fast" db for, say, logging.

Or perhaps you could set a different periodic flush interval for each
database, with 0 equivalent to no delayed commit.

For me, the question is specifically, what guarantees does CouchDB give to
clients about your data safety, and when - for example, at the point where
you get a HTTP response?

There are at least three different scenarios that I'm aware of at the
moment.
1. client supplies 'batch=ok' URL parameter
2. client supplies no special parameters
3. client supplies 'X-Couch-Full-Commit: true' header

>From the client's perspective, I can see no difference between (1) and (2).
After receiving a HTTP response, the data is likely to make it to disk at
some time in the future, but it could be lost if the plug is pulled in the
next few seconds.

In case (3), the document is guaranteed to be on disk after the HTTP
response is returned [as long as drive internal write cache is disabled].
This is equivalent to "QOS level 1" in the MQTT protocol:
http://publib.boulder.ibm.com/infocenter/wmbhelp/v6r0m0/index.jsp?topic=/com.ibm.etools.mft.doc/ac10850_.htm

However, it also forces writes of everything received up to this point, so
it's very inefficient if you are doing lots of writes with this header on.

Sometimes, you don't require data to be written to disk immediately, but you
do want to be notified *when* it has been written to disk in order to take
some subsequent action (such as acknowledging the successful save to a
downstream consumer).

I would like to propose an alternative approach similar to TCP sequence
numbers. We already have a sequence number which counts documents added to
the database (update_seq). I suggest we keep a separate watermark which is
the sequence number when the database was last flushed to disk (say
flush_seq).

Now:

- when you PUT a document, send the update_seq as part of the response
  (let's call it doc_seq)

- update_seq may continue to increment as more documents are updated

- at some point in the future, when data is flushed to disk, set
  flush_seq := update_seq

- if the client is interested to know when its document has been flushed
  to disk, it can poll mydb to check for flush_seq >= doc_seq

- it could be an option in the HTTP request to delay the response until
  flush_seq >= doc_seq

That means you would get the benefit of knowing that the document had been
committed to disk, without the cost of having to commit it. Rather, you wait
until someone else wants to force a full commit, or the periodic full commit
takes place.

Then the only per-database tunable you need is the periodic commit interval.
Set it to 5 seconds for logging databases; 0.2 for RADIUS accounting (where
you want to generate a response within 200ms); and 0 if you want every
single document to be committed as soon as it arrives.

Thoughts?

Something like this is doable at present, but requires a buffering proxy.
For example, you can receive RADIUS accounting updates into a buffer, then
every 200ms do a POST to _bulk_docs with X-Couch-Full-Commit: true and
return success to all the clients.

Since CouchDB has to buffer these documents in the VFS cache anyway, it
would be convenient (and more efficient) to let it handle the periodic
flushing too.

Regards,

Brian.