couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <>
Subject Re: Tail Append Headers
Date Sat, 23 May 2009 14:34:08 GMT
On Sat, May 23, 2009 at 4:21 AM, Robert Dionne
<> wrote:
> Fyi,  I've been using this branch in my testing lately and everything works
> fine except  the latest db upgrade changes break hovercraft:test() in the
> attachment streaming. The call to couch_doc:bin_foLdl now has different
> behavior. The fix was trivial, changing two calls in hovercraft to use the
> length function rather than size.
> I'm happy to push a patch to the hovercraft project when this tail append
> branch is merged to trunk.

I'd been planning on writing that, so if you send the patch my way I
can apply it now, maybe maintained as its own branch as well until the
merge to trunk.


> Cheers,
> Bob
> On May 22, 2009, at 4:56 PM, Damien Katz wrote:
>> On May 18, 2009, at 1:59 PM, Damien Katz wrote:
>>> In branches/tail_header on the svn repository is an working version of
>>> new pure tail append code for CouchDB.
>>> Right now in trunk we have zero-overwrite storage, which meant we never
>>> overwrite any previously committed data, or meta data, or even index
>>> structures.
>>> The exception to the rule is the file header, in all previous versions,
>>> CouchDB stores the database header in the head of the file, but it's written
>>> twice, one after another, each 2k header copy integrity checked. If the
>>> power fails in the middle of writing one copy, the other should still be
>>> available.
>>> With zero-overwrite storage, all btree updates happen at the end of the
>>> file, but document summaries were written into a buffer internally, so that
>>> docs are written contiguously in buffers (usually) near then end of the
>>> file. Big file attachments were written into internally linked buffers also
>>> near the end of the file. This design proves very robust, offers reasonable
>>> update performance and good document retrieval times. Its weaknesses, like
>>> all single storage formats, is if the file gets corrupted, the database
>>> maybe be unrecoverable.
>>> One form of corruption that's seems fairly common (we've seen it at least
>>> 3 times), is file truncation, which is to say the end of the file goes
>>> missing. This seems to happen sometimes after a file system fills up, or
>>> machine suffers a power loss.
>>> Unfortunately, when file truncation happens with CouchDB, it's not just
>>> the last blocks of data that are lost, it's the whole file, because the last
>>> bits of data it writes is the root btree node that's necessary to find the
>>> remaining indexes. It's possible to write a tool to scan back and attempt to
>>> find the correct offset pointers to restore the file, but that's pretty
>>> expensive and wouldn't always be correct.
>>> To fix this, the tail_header branch I created uses something like zero
>>> overwrite storage, and takes it a little further and uses append-only
>>> storage, with every single update or deletion causing an update to the very
>>> end of the file, making the file grow.  Even the header is stored at the end
>>> of the file (more accurate to be called a trailer I guess).
>>> With this design, any file truncation simply results in an earlier
>>> version of the database. If a commit is interrupted before the header gets
>>> completely written, then the next time the database is open, the commit data
>>> is skipped over as it scans backward looking for a valid header.
>>> Every 4k, a single byte has either a value of 1 or 0. A value of 1 means
>>> a header immediately follows the byte, otherwise it's a regular storage
>>> block. Every regular write to the file, if it spans the special byte, is
>>> split up and the special byte inserted. When reading from the file, the
>>> special bytes are automatically stripped out from the data.
>>> When a file is first opened, the header is searched for by scanning back
>>> through the blocks, looking for a valid header that passes all the integrity
>>> checks. Usually this will be very fast, but could be a long scan depending
>>> how much data was written but not before failure.
>>> Besides being very robust in the face of truncation, this format has the
>>> advantage of potentially speeding up the commits greatly, as everything is
>>> written sequentially at the end of the file, allowing tons of data to be
>>> written out without ever having to do a head seek. And fsync can be called
>>> fewer times now. If you have an application where you don't mind losing your
>>> most recent updates, you could turn off fsync all together. However, this
>>> assumes ordered-sequential writes, that the FS will never write out the
>>> later bytes before the earlier bytes.
>>> Large file attachments have more overhead as the files are broken up into
>>> ~4k chunks, and stores a point to each chunk. The means opening a document
>>> requires also loading up the pointers to each chunk, instead of a single
>>> pointer like before.
>>> Upsides:
>>> - Extremely robust storage format. Data truncations, as caused by OS
>>> crashes, incomplete copies, etc, still allow for earlier versions of the
>>> database to be recovered.
>>> - Faster commit speeds (in theory).
>>> - OS level backups are to simply copy the new bytes over. (hmmm but this
>>> won't work with compaction or if we automatically truncate to valid header
>>> on file open).
>>> - Views index updates never require a fsync. (assuming ordered-sequential
>>> writes)
>>> Downsides:
>>> - Every update to the database will have up to 4k of overhead for header
>>> writing (the actual header is smaller, but must be written 4k aligned).
>>> - Individually updated documents are more sparse on disk by default,
>>> making long view builds slower (in theory) as the disk will need to seek
>>> forward more often. (but compaction will fix this)
>>> - On file open, must seek back through the file to find a valid header.
>>> - More overhead for large file attachments.
>>> Work to be done:
>>> - More options for when to do fsync or not, to optimize for underlying
>>> file system (before header write, after header write, not at all, etc)
>>> - Rollback? Do we want to support rolling back the file to previous
>>> versions?
>>> - Truncate on open? - When we open a file, do we want to automatically
>>> truncate off any uncommitted garbage that could be left over?
>>> - Compact should write attachments in one stage of copying, then the
>>> documents themselves, right now attachment and document writes are
>>> interleaved per-document.
>>> - Live upgrade of 0.9.0. It would be nice to be able to serve old style
>>> files to allow for zero downtime on upgrade. Right now the branch doesn't
>>> understand old files at all.
>>> - Possibly we need to fsync on database file open, since the file might
>>> be in the FS cache but not on disk due to a previous CouchDB crash. This can
>>> cause problems if the view indexer (or any indexer, like lucene) updates its
>>> index and it gets committed to disk, but the most recent version of the
>>> database still isn't committed. Then if the OS crashes or powerloss occurs,
>>> the index files might unknowingly reflect lost state in the database, which
>>> would be fixable only by doing a complete view rebuild.
>>> Feedback on all this welcome. Please try out the branch to shake out any
>>> bugs or performance problems that might be lurking.
>>> -Damien
>> So I think this patch is ready for trunk.
>> It now serves old files without downtime and I've tested it out manually,
>> but I haven't written any automated tests for it. If you can please try it
>> out on a trunk database and view(s) and see if everything still works
>> correctly. Also test out compacting the database to fully upgrade it to the
>> current format. Note, please make a backup database before doing this, just
>> opening an old file with the new code causes it to partially upgrade so that
>> previous versions don't recognize it.
>> This live upgrade code is sprinkled throughout the source and the places
>> are marked. We will remove these, probably after the next version (0.10).
>> The new code has ini configurable fsyncs:
>>  [couchdb]
>>  sync_options = [before_header, after_header, on_file_open]
>> By default, all three options are on, you can turn some or all off in the
>> local.ini like this:
>>   [couchdb]
>>   sync_options = []
>> For default transactions, the header is only written out once per second,
>> reducing its size impact particularly in high volume writes. Also the tail
>> append stuff gives us the ability to have even more transaction options to
>> optimize for different systems, but that can all be done later.
>> After discovering how much CPU it was eating, I turned the term
>> compression stuff completely off. But it should be ini or database
>> configurable eventually.  As Adam Kocoloski pointed out, regardless how its
>> compressed when saved it's always readable later.
>> It does not have version rollback or "truncate to valid header on open",
>> but those are features that can be added later without much work if
>> necessary.
>> Feedback please!
>> -Damien

Chris Anderson

View raw message