Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A283F1056E for ; Mon, 7 Oct 2013 15:52:03 +0000 (UTC) Received: (qmail 84772 invoked by uid 500); 7 Oct 2013 15:52:01 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 84540 invoked by uid 500); 7 Oct 2013 15:51:59 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 84523 invoked by uid 99); 7 Oct 2013 15:51:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Oct 2013 15:51:59 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [64.68.200.141] (HELO mailout.easydns.com) (64.68.200.141) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Oct 2013 15:51:54 +0000 Received: from localhost (localhost [127.0.0.1]) by mailout.easydns.com (Postfix) with ESMTP id B3D21E978; Mon, 7 Oct 2013 11:51:31 -0400 (EDT) X-Virus-Scanned: Debian amavisd-new at mailout.easydns.com Received: from mailout.easydns.com ([127.0.0.1]) by localhost (mailout.easydns.vpn [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KTmkDT8PP8GN; Mon, 7 Oct 2013 11:51:30 -0400 (EDT) Received: from [134.117.194.234] (dhcp-234.gcrc.carleton.ca [134.117.194.234]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mailout.easydns.com (Postfix) with ESMTPSA id 1BED6E96D; Mon, 7 Oct 2013 11:51:30 -0400 (EDT) Message-ID: <5252D880.6060400@fiset.ca> Date: Mon, 07 Oct 2013 11:51:28 -0400 From: Jean-Pierre Fiset User-Agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.0 MIME-Version: 1.0 To: user@couchdb.apache.org Subject: Re: Contribution: CouchDb dump and reload References: <524DB903.5010405@fiset.ca> <524DBABA.8080903@orgmeta.com> <9292ECFF-41C2-498C-A062-5FB6775C8B1F@couchbase.com> <46C56CF3-C72A-49E3-AAD1-9A5EF914C74B@programmazione.it> In-Reply-To: <46C56CF3-C72A-49E3-AAD1-9A5EF914C74B@programmazione.it> X-Enigmail-Version: 1.5.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org The tool we have developed uses a strategy similar to all_docs?include_docs=true to dump. The format on disk breaks the document into the top level attributes into different files to make it easy for humans to understand/edit. The format is similar to that used by the python tool "couchapp". Furthermore, the attachments are saved in their own files with some supporting information to keep track of information such as content type. The restore process is more elaborate. Here are some features: - When restoring, the restore tool ensures that a document modified since the last dump is not replaced by the document on disk. This ensures that the documents located in the database remains the authoritative ones. On a restore, an operator can force the restoration, but by default it attempts to protect the database. - It is possible to restore only specific documents. This currently done by specifying the identifiers of the documents to be restored. - When uploading a documents, all the changes are applied at once, increasing the document revision only by one. - Disk documents that are equivalent to the ones found in the database are detected and the upload is skipped. The only "proprietary" facet of this process are a digest installed on the database document to keep track of the document's content, the name of the attribute to save the digest and the method the digest is computed. Every time a document is uploaded using the restore tool, a digest of the document is calculated and added to the document. This allows the restore tool to find out whether a database document has been manually modified since the last time it was uploaded. Since it would be difficult for someone to compute the digest, or craft a new version document to collide with the current digest, it ensures that the restore tool does not inadvertantly overwrite changes to the database perform by a human. The method to compute the digest is straight forward and could be standardized to allow various tools to interact with a single database. As far as standardizing how documents should be stored to disk, it would probably be a worthwhile endeavour. JP On 2013-10-03 20:28, Filippo Fadda wrote: > Is it basically an all_docs + include_docs? > > -Filippo > > On Oct 3, 2013, at 11:51 PM, Alexander Shorin wrote: > >> On Fri, Oct 4, 2013 at 1:28 AM, Jens Alfke wrote: >>> On Oct 3, 2013, at 11:43 AM, Vivek Pathak wrote: >>> >>>> Just fyi, there is couchdb-dump available in >>>> tihttp://code.google.com/p/couchdb-python/ >>> >>> Looks like these two tools use entirely different data formats. Has anyone thought of defining a common format for database dumps? >> >> I think it will be hard to define such. >> >> Dumping CouchDB data as JSON looks intuitive and requires less >> additional actions for import/export. Having couchdb-python approach >> with multipart format provides lesser footprint, but requires a more >> tricky processing (boundaries, headers). Both solutions may use >> CouchDB API without any additional data conversion. And both requires >> a lot of disk space, much more than if you just copy database file or >> make a replica of it, unless you xz-zip the output. >> >> -- >> ,,,^..^,,, >