couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Dionne <>
Subject Re: Tail Append Headers
Date Sat, 23 May 2009 11:21:56 GMT
Fyi,  I've been using this branch in my testing lately and everything  
works fine except  the latest db upgrade changes break hovercraft:test 
() in the attachment streaming. The call to couch_doc:bin_foLdl now  
has different behavior. The fix was trivial, changing two calls in  
hovercraft to use the length function rather than size.

I'm happy to push a patch to the hovercraft project when this tail  
append branch is merged to trunk.



On May 22, 2009, at 4:56 PM, Damien Katz wrote:

> On May 18, 2009, at 1:59 PM, Damien Katz wrote:
>> In branches/tail_header on the svn repository is an working  
>> version of new pure tail append code for CouchDB.
>> Right now in trunk we have zero-overwrite storage, which meant we  
>> never overwrite any previously committed data, or meta data, or  
>> even index structures.
>> The exception to the rule is the file header, in all previous  
>> versions, CouchDB stores the database header in the head of the  
>> file, but it's written twice, one after another, each 2k header  
>> copy integrity checked. If the power fails in the middle of  
>> writing one copy, the other should still be available.
>> With zero-overwrite storage, all btree updates happen at the end  
>> of the file, but document summaries were written into a buffer  
>> internally, so that docs are written contiguously in buffers  
>> (usually) near then end of the file. Big file attachments were  
>> written into internally linked buffers also near the end of the  
>> file. This design proves very robust, offers reasonable update  
>> performance and good document retrieval times. Its weaknesses,  
>> like all single storage formats, is if the file gets corrupted,  
>> the database maybe be unrecoverable.
>> One form of corruption that's seems fairly common (we've seen it  
>> at least 3 times), is file truncation, which is to say the end of  
>> the file goes missing. This seems to happen sometimes after a file  
>> system fills up, or machine suffers a power loss.
>> Unfortunately, when file truncation happens with CouchDB, it's not  
>> just the last blocks of data that are lost, it's the whole file,  
>> because the last bits of data it writes is the root btree node  
>> that's necessary to find the remaining indexes. It's possible to  
>> write a tool to scan back and attempt to find the correct offset  
>> pointers to restore the file, but that's pretty expensive and  
>> wouldn't always be correct.
>> To fix this, the tail_header branch I created uses something like  
>> zero overwrite storage, and takes it a little further and uses  
>> append-only storage, with every single update or deletion causing  
>> an update to the very end of the file, making the file grow.  Even  
>> the header is stored at the end of the file (more accurate to be  
>> called a trailer I guess).
>> With this design, any file truncation simply results in an earlier  
>> version of the database. If a commit is interrupted before the  
>> header gets completely written, then the next time the database is  
>> open, the commit data is skipped over as it scans backward looking  
>> for a valid header.
>> Every 4k, a single byte has either a value of 1 or 0. A value of 1  
>> means a header immediately follows the byte, otherwise it's a  
>> regular storage block. Every regular write to the file, if it  
>> spans the special byte, is split up and the special byte inserted.  
>> When reading from the file, the special bytes are automatically  
>> stripped out from the data.
>> When a file is first opened, the header is searched for by  
>> scanning back through the blocks, looking for a valid header that  
>> passes all the integrity checks. Usually this will be very fast,  
>> but could be a long scan depending how much data was written but  
>> not before failure.
>> Besides being very robust in the face of truncation, this format  
>> has the advantage of potentially speeding up the commits greatly,  
>> as everything is written sequentially at the end of the file,  
>> allowing tons of data to be written out without ever having to do  
>> a head seek. And fsync can be called fewer times now. If you have  
>> an application where you don't mind losing your most recent  
>> updates, you could turn off fsync all together. However, this  
>> assumes ordered-sequential writes, that the FS will never write  
>> out the later bytes before the earlier bytes.
>> Large file attachments have more overhead as the files are broken  
>> up into ~4k chunks, and stores a point to each chunk. The means  
>> opening a document requires also loading up the pointers to each  
>> chunk, instead of a single pointer like before.
>> Upsides:
>> - Extremely robust storage format. Data truncations, as caused by  
>> OS crashes, incomplete copies, etc, still allow for earlier  
>> versions of the database to be recovered.
>> - Faster commit speeds (in theory).
>> - OS level backups are to simply copy the new bytes over. (hmmm  
>> but this won't work with compaction or if we automatically  
>> truncate to valid header on file open).
>> - Views index updates never require a fsync. (assuming ordered- 
>> sequential writes)
>> Downsides:
>> - Every update to the database will have up to 4k of overhead for  
>> header writing (the actual header is smaller, but must be written  
>> 4k aligned).
>> - Individually updated documents are more sparse on disk by  
>> default, making long view builds slower (in theory) as the disk  
>> will need to seek forward more often. (but compaction will fix this)
>> - On file open, must seek back through the file to find a valid  
>> header.
>> - More overhead for large file attachments.
>> Work to be done:
>> - More options for when to do fsync or not, to optimize for  
>> underlying file system (before header write, after header write,  
>> not at all, etc)
>> - Rollback? Do we want to support rolling back the file to  
>> previous versions?
>> - Truncate on open? - When we open a file, do we want to  
>> automatically truncate off any uncommitted garbage that could be  
>> left over?
>> - Compact should write attachments in one stage of copying, then  
>> the documents themselves, right now attachment and document writes  
>> are interleaved per-document.
>> - Live upgrade of 0.9.0. It would be nice to be able to serve old  
>> style files to allow for zero downtime on upgrade. Right now the  
>> branch doesn't understand old files at all.
>> - Possibly we need to fsync on database file open, since the file  
>> might be in the FS cache but not on disk due to a previous CouchDB  
>> crash. This can cause problems if the view indexer (or any  
>> indexer, like lucene) updates its index and it gets committed to  
>> disk, but the most recent version of the database still isn't  
>> committed. Then if the OS crashes or powerloss occurs, the index  
>> files might unknowingly reflect lost state in the database, which  
>> would be fixable only by doing a complete view rebuild.
>> Feedback on all this welcome. Please try out the branch to shake  
>> out any bugs or performance problems that might be lurking.
>> -Damien
> So I think this patch is ready for trunk.
> It now serves old files without downtime and I've tested it out  
> manually, but I haven't written any automated tests for it. If you  
> can please try it out on a trunk database and view(s) and see if  
> everything still works correctly. Also test out compacting the  
> database to fully upgrade it to the current format. Note, please  
> make a backup database before doing this, just opening an old file  
> with the new code causes it to partially upgrade so that previous  
> versions don't recognize it.
> This live upgrade code is sprinkled throughout the source and the  
> places are marked. We will remove these, probably after the next  
> version (0.10).
> The new code has ini configurable fsyncs:
>   [couchdb]
>   sync_options = [before_header, after_header, on_file_open]
> By default, all three options are on, you can turn some or all off  
> in the local.ini like this:
>    [couchdb]
>    sync_options = []
> For default transactions, the header is only written out once per  
> second, reducing its size impact particularly in high volume  
> writes. Also the tail append stuff gives us the ability to have  
> even more transaction options to optimize for different systems,  
> but that can all be done later.
> After discovering how much CPU it was eating, I turned the term  
> compression stuff completely off. But it should be ini or database  
> configurable eventually.  As Adam Kocoloski pointed out, regardless  
> how its compressed when saved it's always readable later.
> It does not have version rollback or "truncate to valid header on  
> open", but those are features that can be added later without much  
> work if necessary.
> Feedback please!
> -Damien

View raw message