couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Inability to back up and restore properly
Date Thu, 09 Apr 2009 03:25:32 GMT
On Wed, Apr 8, 2009 at 10:45 PM, Jeff Hinrichs - DM&T
<dundeemt@gmail.com> wrote:
> On Tue, Apr 7, 2009 at 10:59 AM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
>>
>> On Mon, Apr 6, 2009 at 11:04 PM, Jeff Hinrichs - DM&T
>> <jeffh@dundeemt.com> wrote:
>> > What is the proper way to backup and restore a couchdb?  I mean a real
>> > proper dump/load cycle.
>> >
>> > couchdb doesn't provide a way to do a proper dump/reload cycle which leaves
>> > us to try and write our own.  However, if you dump a document like
>> >
>> > {'_id':'foo',''_rev':'2-xyz',...}
>> >
>> > There is not a single way that I can find to load an empty database and
>> > recreate that same record.  If you put the
>> > {'_id':'foo',''_rev':'2-xyz',...}, you get
>> > {'_id':'foo',''_rev':'3-mno',...},  which is not the same as
>> > {'_id':'foo',''_rev':'2-xyz',...}.
>> >
>>
>> You'll need to look into the bulk_docs api as it's used by
>> replication. Admittedly this isn't properly documented just yet but
>> that is exactly the endpoint that would allow for this.
>>
>> Glancing at the code, it looks like to accomplish what you want it'd
>> be something like:
>>
>> POST /db_name/_bulk_docs
>>
>> {
>>    "new_edits": false,
>>    "docs": [
>>        {"_id": "foo", "_rev": "2-xyz", ...}
>>    ]
>> }
>>
>> > In some use cases it is necessary to be able to restore data to the way it
>> > was at a point in time.  Sometimes for logic reasons, some times for error
>> > recovery and debugging and some times for legal reasons.  The seemingly only
>> > way possible to do that is to bring up another couchdb instance and
>> > replicate to it.  However, that is a bit problematic for normal long term
>> > storage methodologies.
>> >
>> > What is the API I should be using?   If no such api exists, is it an
>> > oversight or just a matter of resources?  There should be a way to load data
>> > into couch and have couchdb just accept it, keeping the _rev information
>> > that is passed.  I am not proposing to change the mode of operation, but to
>> > create a new one.  Even better would be to have couchdb do a /database/_dump
>> > that streams out documents and a post /database/_load with the posted file
>> > from the /database/_dump.
>> >
>> > so that you have some couchdb database foo in state 'A', you dump, then
>> > create database bar, and load the dump from foo and when the process is
>> > finished, a replication from foo state 'A' to bar results in
>> > {"start_time":"Tue, 07 Apr 2009 03:02:16 GMT","end_time":"Tue, 07 Apr 2009
>> > 03:02:16
>> > GMT","start_last_seq":0,"end_last_seq":100,"missing_checked":100,"missing_found":0,"docs_read":0,"docs_written":0,"doc_write_failures":0}
>> >
>> >
>> > Regards,
>> >
>> > Jeff Hinrichs
>> >
>>
>> You should also check out couchdb-python's dump-load scripts and Chris
>> Anderson's Ruby script on the breaking changes 0.9 page for other
>> examples of scripts to dump/reload a database.
>
> I am aware of the python scripts, (I wrote and submitted my own to
> overcome problems with dumping large databases.) however they are not
> true dump/load scripts in the sense that the database created from
> loading a dump, is not the same as the original.  Caused by _rev being
> modified when put'ing the information.   What I am really looking for
> is best imagined if you do a thought experiment and assume you can
> replicate to a file.  And then replicate from that file to an empty
> couchdb database.  In the end you would end up with two databases
> that were the same and replication between them occurs with no
> unexpected results.
>

I *think* that using the _bulk_docs new_edits=false flag would not
alter the revisions posted, but I'm not overly familiar with the
replication API. The thought experiment of having an exact copy of the
DB in some clear text format is a very desirable trait that we should
support. If its really not currently possible then we should
definitely figure out why not and how to overcome that if possible.

> Currently, if you dump from a database A whose records have
> attachments and then load that dump file into a new empty database B
> and then replicate from A->B things don't work right.  It looks like
> it replicated, but if you check the file size, you will note that it
> has grown to 2x the size of the attachments. No amount of compaction
> will reduce the database size. which is now
> SizeOfDocs+2xSizeOfAttachments
>

This is the first time I've heard of such behavior and it definitely
sounds like a bug. If there's not a ticket already you should fill one
out so that we can keep track of it as we work toward 0.10.

> I will take a look at bulk docs and see if it helps, but I don't have
> high hopes, because by spec, when you insert or update a record, the
> _rev is automatically incremented and returned.  This pretty much
> guarantees that A will never equal B and that replication from A->B is
> broken since the _revs in B look different then the _revs in A
>
> I'll post back if I get unexpected positive results from using the
> bulk_update api.
>

I'm pretty sure the new_edits flag is to make sure that you end up
with the same current revision and revision history on both nodes
(assuming replicate A -> B and B -> A). If every save updated the _rev
then replication would be unable to provide eventual consistency.

>> HTH,
>> Paul Davis
>
> Regards,
>
> Jeff Hinrichs
>

HTH,
Paul Davis

Mime
View raw message