couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <>
Subject Re: Tail Append Headers
Date Wed, 20 May 2009 08:12:57 GMT
On Mon, May 18, 2009 at 01:59:08PM -0400, Damien Katz wrote:
> If you have an application where you don't mind 
> losing your most recent updates, you could turn off fsync all together. 
> However, this assumes ordered-sequential writes, that the FS will never 
> write out the later bytes before the earlier bytes.

... and also that the drive doesn't reorder the writes itself.

You could checksum each 4K data block (if that's not done already), but then
you'd need to scan from the front of the file to the end to find the first
invalid block. Perhaps that's the job of a recovery tool, rather than
something to be done every time the database is opened.

> Downsides:
> - Every update to the database will have up to 4k of overhead for header 
> writing (the actual header is smaller, but must be written 4k aligned).

At 4KB, one (batch of) writes per second equates to ~337MB per day of
overhead. Fairly significant, although perhaps not too bad with daily or
weekly compaction.

1K blocks would be a lot better from that point of view, presumably at the
cost of more work breaking up docs and attachments.

For writes of individual small docs, do you always write out a 4KB data
block followed by a 4KB header block? If so, a simple optimisation would be
a mixed data+header block:

00 .... data
01 hh hh .... <<hhhh bytes of data followed by header>>>

I'd think that it's pointless to write out a new header unless there has
been some data to write as well.

> - Individually updated documents are more sparse on disk by default,  
> making long view builds slower (in theory) as the disk will need to seek 
> forward more often. (but compaction will fix this)

Maybe it won't do head seeks, if the O/S or application does read-ahead, but
you'll certainly be reading through a larger file.

In general, I would happily trade some speed for better crash-resistance.

As for timing of fsync: ideally what I would like is for each write
operation to return some sort of checkpoint tag (which could just be the
current file size). Then have a new HTTP operation "wait for sync <tag>"
which blocks until the file has been fsync'd at or after that position.

This would allow useful semantics in proxies. e.g. I don't want to return an
acknowledgement to a client until the document has been safely written to at
least two locations, but I don't want to force an fsync for those requests.

One other thought. Is there value in extending the file in chunks of, say,
16MB - in the hope that the O/S is more likely to allocate contiguous
regions of storage?



View raw message