couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <>
Subject Tail Append Headers
Date Mon, 18 May 2009 17:59:08 GMT
In branches/tail_header on the svn repository is an working version of  
new pure tail append code for CouchDB.

Right now in trunk we have zero-overwrite storage, which meant we  
never overwrite any previously committed data, or meta data, or even  
index structures.

The exception to the rule is the file header, in all previous  
versions, CouchDB stores the database header in the head of the file,  
but it's written twice, one after another, each 2k header copy  
integrity checked. If the power fails in the middle of writing one  
copy, the other should still be available.

With zero-overwrite storage, all btree updates happen at the end of  
the file, but document summaries were written into a buffer  
internally, so that docs are written contiguously in buffers (usually)  
near then end of the file. Big file attachments were written into  
internally linked buffers also near the end of the file. This design  
proves very robust, offers reasonable update performance and good  
document retrieval times. Its weaknesses, like all single storage  
formats, is if the file gets corrupted, the database maybe be  

One form of corruption that's seems fairly common (we've seen it at  
least 3 times), is file truncation, which is to say the end of the  
file goes missing. This seems to happen sometimes after a file system  
fills up, or machine suffers a power loss.

Unfortunately, when file truncation happens with CouchDB, it's not  
just the last blocks of data that are lost, it's the whole file,  
because the last bits of data it writes is the root btree node that's  
necessary to find the remaining indexes. It's possible to write a tool  
to scan back and attempt to find the correct offset pointers to  
restore the file, but that's pretty expensive and wouldn't always be  

To fix this, the tail_header branch I created uses something like zero  
overwrite storage, and takes it a little further and uses append-only  
storage, with every single update or deletion causing an update to the  
very end of the file, making the file grow.  Even the header is stored  
at the end of the file (more accurate to be called a trailer I guess).

With this design, any file truncation simply results in an earlier  
version of the database. If a commit is interrupted before the header  
gets completely written, then the next time the database is open, the  
commit data is skipped over as it scans backward looking for a valid  

Every 4k, a single byte has either a value of 1 or 0. A value of 1  
means a header immediately follows the byte, otherwise it's a regular  
storage block. Every regular write to the file, if it spans the  
special byte, is split up and the special byte inserted. When reading  
from the file, the special bytes are automatically stripped out from  
the data.

When a file is first opened, the header is searched for by scanning  
back through the blocks, looking for a valid header that passes all  
the integrity checks. Usually this will be very fast, but could be a  
long scan depending how much data was written but not before failure.

Besides being very robust in the face of truncation, this format has  
the advantage of potentially speeding up the commits greatly, as  
everything is written sequentially at the end of the file, allowing  
tons of data to be written out without ever having to do a head seek.  
And fsync can be called fewer times now. If you have an application  
where you don't mind losing your most recent updates, you could turn  
off fsync all together. However, this assumes ordered-sequential  
writes, that the FS will never write out the later bytes before the  
earlier bytes.

Large file attachments have more overhead as the files are broken up  
into ~4k chunks, and stores a point to each chunk. The means opening a  
document requires also loading up the pointers to each chunk, instead  
of a single pointer like before.

- Extremely robust storage format. Data truncations, as caused by OS  
crashes, incomplete copies, etc, still allow for earlier versions of  
the database to be recovered.
- Faster commit speeds (in theory).
- OS level backups are to simply copy the new bytes over. (hmmm but  
this won't work with compaction or if we automatically truncate to  
valid header on file open).
- Views index updates never require a fsync. (assuming ordered- 
sequential writes)

- Every update to the database will have up to 4k of overhead for  
header writing (the actual header is smaller, but must be written 4k  
- Individually updated documents are more sparse on disk by default,  
making long view builds slower (in theory) as the disk will need to  
seek forward more often. (but compaction will fix this)
- On file open, must seek back through the file to find a valid header.
- More overhead for large file attachments.

Work to be done:
- More options for when to do fsync or not, to optimize for underlying  
file system (before header write, after header write, not at all, etc)
- Rollback? Do we want to support rolling back the file to previous  
- Truncate on open? - When we open a file, do we want to automatically  
truncate off any uncommitted garbage that could be left over?
- Compact should write attachments in one stage of copying, then the  
documents themselves, right now attachment and document writes are  
interleaved per-document.
- Live upgrade of 0.9.0. It would be nice to be able to serve old  
style files to allow for zero downtime on upgrade. Right now the  
branch doesn't understand old files at all.
- Possibly we need to fsync on database file open, since the file  
might be in the FS cache but not on disk due to a previous CouchDB  
crash. This can cause problems if the view indexer (or any indexer,  
like lucene) updates its index and it gets committed to disk, but the  
most recent version of the database still isn't committed. Then if the  
OS crashes or powerloss occurs, the index files might unknowingly  
reflect lost state in the database, which would be fixable only by  
doing a complete view rebuild.

Feedback on all this welcome. Please try out the branch to shake out  
any bugs or performance problems that might be lurking.


View raw message