couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: Next-generation attachment storage.
Date Wed, 26 Jan 2011 16:46:49 GMT
On Wed, Jan 26, 2011 at 9:47 AM, Robert Newson <> wrote:
> Luwak looks very interesting, thanks!
> As I noted originally, the harder part of the work is integrated in
> with couchdb and/or replacing the current attachment code entirely
> (which is my preference), so I went with the simplest approach to
> externalizing attachments (one attachment per file).
> The issue of synchronizing the data between the two storage systems
> needs some careful thought. My current approach is to put data into
> the attachment store (whether haystack, luwak or custom) with a
> 'provisional' marker. After we write_and_commit, we go back and mark
> it as final. We do something similar for removal ('provisionally
> removed' -> 'removed'). This will allow us, in most circumstances, to
> know the status of an item in the attachment store without
> cross-referencing it with couchdb. This will be important when
> compacting the attachment storage files (necessary in haystack, no
> clue yet for luwak).

I'm going to make a simplifying assumption of a haystack layout with a
single file but this should extend to having multiple files easily.

All we should need to do to make sure the files are in sync is that
anything referenced by the main database has been synced to disk
before we sync the main db. So, if the database keeps a bit of state
(say, last byte in the haystack file that is referenced assuming
append only attachments) and haystack keeps the last byte of the file
that it ran an fsync on, then we can efficiently (ie, only sync
haystack when necessary) each time we sync the main db. For multiple
files, its just an integer per file.

The part that i like about this is that it's almost identical to what
we're doing now except our append only strategy makes it slightly more
impossible to be syncing a db header that indirectly references beyond
the end of the file where we're writing the header.

When we start talking about pre-sync/post-sync I start thinking
"failure permutation".

> B.
> On Wed, Jan 26, 2011 at 2:35 PM, Benoit Chesneau <> wrote:
>> On Wed, Jan 26, 2011 at 2:20 PM, Robert Newson <> wrote:
>>> All,
>>> Most of you know that I'm currently working on 'external attachments'.
>>> I've spent quite some time reading and modifying the current code and
>>> have tried several approaches to the problem. I've implemented one
>>> version fairly completely
>>> ( which
>>> places any attachment over a threshold (defaulting to 256 kb) into a
>>> separate file (and all files that are sent chunked). This branch works
>>> for PUT/GET/DELETE, local and remote replication and compaction.
>>> External attachments do not support compression or ranges yet.
>>> At this point, I'd like to get some feedback. I don't believe
>>> file-per-attachment is a solution that works for everyone but it was
>>> necessary to make a choice in order to understand how to integrate any
>>> kind of external attachment into couchdb.
>>> So, here's my real proposal for CouchDB 1.2 (or 2.0?);
>>> Attachments are stored contiguously in compound files following a
>>> simplified form of Haystack
>>> ( I won't
>>> describe Haystack in detail as the article covers it, and it's not
>>> exactly what we need (the indexes, for example, are pointless, given
>>> we have a database). The basic idea is we have a small number of files
>>> that we append to, the limit of concurrency being the number of files
>>> (i.e, we will not interleave attachments in these files).
>>> There are several consequences to this;
>>> Pro
>>> 1) we can remove the 4k blocking in .couch files.
>>> 2) .couch files are smaller, improving all i/o operations (especially
>>> compaction).
>>> 3) we can use more efficient primitives (like sendfile) to fetch attachments.
>>> Con
>>> 1) haystack files need compaction (though this involves no seeking so
>>> should be far better than .couch compaction)
>>> 2) more file descriptors
>>> 3) .couch files are no longer self-contained (complicating backup
>>> schemes, migration)
>>> I had originally planned for each database to have exclusive access to
>>> N haystack files (N is configurable, of course) since this aids with
>>> backups. However, another compelling option is to have N haystack
>>> files for all databases. This reduces the number of file descriptors
>>> needed, but complicates backup (we'd probably have to write a tool to
>>> extract matching attachments).
>> I would go for one file / db, so we could remove attachments in the
>> same time we delete a db.
>> The CONS about that is that we can't share attachements between db if
>> their signatures are the same. Another way would be to maintain an
>> index of attachements / dbs so we could remove then if they don't
>> appear to any other db after one have been removed.
>>> I've rushed through that rather breezily, I apologize. I've been
>>> thinking about this for quite some time so I likely have answers to
>>> most questions on this.
>>> B.
>> That's a good idea anyway. Also did you have a look in luwak from basho ?
>> I know that's the implementation is different but I like the idea to
>> reuse the db to put attachements / chunks. So we could imagine to
>> dispatch chunks as we do for docs on cluster solutions. We could also
>> imagine to handle metadatas.
>> - benoit

View raw message