couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Pierre Fiset ...@fiset.ca>
Subject Re: Contribution: CouchDb dump and reload
Date Mon, 07 Oct 2013 15:51:28 GMT
The tool we have developed uses a strategy similar to all_docs?include_docs=true to dump. The
format on disk breaks the document into the top level attributes into different files to make
it
easy for humans to understand/edit. The format is similar to that used by the python tool
"couchapp". Furthermore, the attachments are saved in their own files with some supporting
information to keep track of information such as content type.

The restore process is more elaborate. Here are some features:

- When restoring, the restore tool ensures that a document modified since the last dump is
not
replaced by the document on disk. This ensures that the documents located in the database
remains the authoritative ones. On a restore, an operator can force the restoration, but by
default it attempts to protect the database.

- It is possible to restore only specific documents. This currently done by specifying the
identifiers of the documents to be restored.

- When uploading a documents, all the changes are applied at once, increasing the document
revision only by one.

- Disk documents that are equivalent to the ones found in the database are detected and the
upload is skipped.

The only "proprietary" facet of this process are a digest installed on the database document
to
keep track of the document's content, the name of the attribute to save the digest and the
method the digest is computed. Every time a document is uploaded using the restore tool, a
digest of the document is calculated and added to the document. This allows the restore tool
to
find out whether a database document has been manually modified since the last time it was
uploaded. Since it would be difficult for someone to compute the digest, or craft a new version
document to collide with the current digest, it ensures that the restore tool does not
inadvertantly overwrite changes to the database perform by a human.

The method to compute the digest is straight forward and could be standardized to allow various
tools to interact with a single database.

As far as standardizing how documents should be stored to disk, it would probably be a
worthwhile endeavour.

JP

On 2013-10-03 20:28, Filippo Fadda wrote:
> Is it basically an all_docs + include_docs?
> 
> -Filippo
> 
> On Oct 3, 2013, at 11:51 PM, Alexander Shorin wrote:
> 
>> On Fri, Oct 4, 2013 at 1:28 AM, Jens Alfke <jens@couchbase.com> wrote:
>>> On Oct 3, 2013, at 11:43 AM, Vivek Pathak <vpathak@orgmeta.com> wrote:
>>>
>>>> Just fyi,  there is couchdb-dump available in
>>>> tihttp://code.google.com/p/couchdb-python/
>>>
>>> Looks like these two tools use entirely different data formats. Has anyone thought
of defining a common format for database dumps?
>>
>> I think it will be hard to define such.
>>
>> Dumping CouchDB data as JSON looks intuitive and requires less
>> additional actions for import/export. Having couchdb-python approach
>> with multipart format provides lesser footprint, but requires a more
>> tricky processing (boundaries, headers). Both solutions may use
>> CouchDB API without any additional data conversion.  And both requires
>> a lot of disk space, much more than if you just copy database file or
>> make a replica of it, unless you xz-zip the output.
>>
>> --
>> ,,,^..^,,,
> 


Mime
View raw message