couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <>
Subject Re: Tail Append Headers
Date Wed, 20 May 2009 15:48:10 GMT

On May 20, 2009, at 4:12 AM, Brian Candler wrote:

> On Mon, May 18, 2009 at 01:59:08PM -0400, Damien Katz wrote:
>> If you have an application where you don't mind
>> losing your most recent updates, you could turn off fsync all  
>> together.
>> However, this assumes ordered-sequential writes, that the FS will  
>> never
>> write out the later bytes before the earlier bytes.
> ... and also that the drive doesn't reorder the writes itself.

Correct, that's what I mean by FS, file system. Different setups will  
have different behaviors, which is why we'll have the flush settings  
ini customizable to optimize for the underlying FS (fsync before  
header write, after header write, not at all, etc).

> You could checksum each 4K data block (if that's not done already),  
> but then
> you'd need to scan from the front of the file to the end to find the  
> first
> invalid block. Perhaps that's the job of a recovery tool, rather than
> something to be done every time the database is opened.
>> Downsides:
>> - Every update to the database will have up to 4k of overhead for  
>> header
>> writing (the actual header is smaller, but must be written 4k  
>> aligned).
> At 4KB, one (batch of) writes per second equates to ~337MB per day of
> overhead. Fairly significant, although perhaps not too bad with  
> daily or
> weekly compaction.

The header overhead is more like 2k on average. But I don't mind  
wasting a diskspace, it's an extremely cheap resourc..

> 1K blocks would be a lot better from that point of view, presumably  
> at the
> cost of more work breaking up docs and attachments.

Sure, I picked 4k as a number out of the air, it's something that's  
can be tuned.

> For writes of individual small docs, do you always write out a 4KB  
> data
> block followed by a 4KB header block? If so, a simple optimisation  
> would be
> a mixed data+header block:
> 00 .... data
> 01 hh hh .... <<hhhh bytes of data followed by header>>>
> I'd think that it's pointless to write out a new header unless there  
> has
> been some data to write as well.

We don't write out a db header unless it's changed.

>> - Individually updated documents are more sparse on disk by default,
>> making long view builds slower (in theory) as the disk will need to  
>> seek
>> forward more often. (but compaction will fix this)
> Maybe it won't do head seeks, if the O/S or application does read- 
> ahead, but
> you'll certainly be reading through a larger file.
> In general, I would happily trade some speed for better crash- 
> resistance.
> As for timing of fsync: ideally what I would like is for each write
> operation to return some sort of checkpoint tag (which could just be  
> the
> current file size). Then have a new HTTP operation "wait for sync  
> <tag>"
> which blocks until the file has been fsync'd at or after that  
> position.

Just use the x-couch-full-commit=true http header.

> This would allow useful semantics in proxies. e.g. I don't want to  
> return an
> acknowledgement to a client until the document has been safely  
> written to at
> least two locations, but I don't want to force an fsync for those  
> requests.
> One other thought. Is there value in extending the file in chunks  
> of, say,
> 16MB - in the hope that the O/S is more likely to allocate contiguous
> regions of storage?

I don't know, probably depends on the FS.


View raw message