couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <>
Subject Re: Detailed info on the B-tree store? Native implementations thereof?
Date Tue, 11 Aug 2009 23:15:20 GMT

On 12 Aug 2009, at 01:08, Robert Newson wrote:

>> The worst problem is that the disk controller will reorder sector  
>> writes to reduce seek time, which in effect means that if power is  
>> lost, some random subset of the last writes may not happen. So you  
>> won't just end up with a truncated file
> But what about that issue? It think it's tolerated because couchdb
> searches backward for the last non-corrupt header, right?
> find_header(_Fd, -1) ->
>    no_valid_header;
> find_header(Fd, Block) ->
>    case (catch load_header(Fd, Block)) of
>    {ok, Bin} ->
>        {ok, Bin};
>    _Error ->
>        find_header(Fd, Block -1)
>    end.

Jens' next sentence was:

> So you won't just end up with a truncated file — you could have a  
> file that seems intact and has a correct header at the end, but has  
> 4k bytes of garbage somewhere within the last transaction. Does  
> CouchDB's file structure guard against that?

As far as I understand the file format, we're not safe against that.


> On Tue, Aug 11, 2009 at 6:37 PM, Chris Anderson<>  
> wrote:
>> On Tue, Aug 11, 2009 at 8:43 AM, Jens Alfke<>  
>> wrote:
>>> I'm interested in the underpinnings of the CouchDB server — the  
>>> crash-proof
>>> concurrent B-tree store. There's a blog post linked to in the wiki  
>>> that
>>> describes the basic concepts (leaves and updated intermediate  
>>> nodes are
>>> appended to the file; the start of the file stores two links to  
>>> the root
>>> node) but is there any more detailed description[1]? And is there  
>>> any
>>> similar technology available that's implemented in native code (C/C 
>>> ++)?[2]
>> Jens,
>> Glad to have you interested. There are a few posts about the B-tree
>> store. This is probably the best, but slightly out of date:
>> Since this article, we've changed the header handling, so that we
>> don't keep it at the top of the file, but instead append the header  
>> at
>> the end of the file at every commit. The strict append-only nature of
>> the storage engine is the source of it's robustness. Even an extreme
>> action, like truncating the file, will not result in an inconsistent
>> state.
>> The other aspect our API that web storage will need to be
>> concurrency-friendly is MVCC. Without MVCC you end up needing long
>> transactions between page-loads, like localStorage currently has,
>> which makes it useless for sharing state between windows. As I
>> analyzed in that blog post, once you have CouchDB-style MVCC tokens,
>> you pretty much need to start dealing in documents to manage the {id,
>> rev} tuple.
>> Maybe the easiest thing would be to just start bundling CouchDB with
>> your browser. :)
>> I'll be living in Berkeley starting next month, so if you'd like to
>> get together perhaps I can help get you oriented in the source code  
>> so
>> you can see this stuff in action, yourself. Erlang is surprisingly
>> simple once you get started.
>> Chris
>>> Basically I'm interested in whether it's feasible to build a  
>>> simple storage
>>> system (for use in an HTML5 Web browser) that a CouchDB-compatible  
>>> client
>>> library could be built on top of. JChris has posted about this topic
>>> recently[3], and pointed out that the hashtable-oriented key-value  
>>> store
>>> currently speced in HTML5 is a poor match for CouchDB. Moreover,  
>>> the SQLite
>>> database engine underneath it doesn't guarantee data integrity  
>>> after a hard
>>> system crash (as I know from painful experience.) So: could we  
>>> build a
>>> fault-tolerant B-tree based API into the browser? (This isn't just  
>>> academic
>>> curiosity: I recently started work on the Chrome team at Google,  
>>> and HTML5
>>> local storage is one of my group's responsibilities.)
>>> Thanks!
>>> —Jens
>>> [1] Alas, I cannot Use The Source, Luke, as I do not have Erlang  
>>> skillz. :(
>>> [2] I know of many, many B-tree libraries (Berkeley DB,  
>>> TokyoCabinet...) but
>>> none that are fault-tolerant.
>>> [3]
>> --
>> Chris Anderson

View raw message