couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Hinrichs - DM&T" <dunde...@gmail.com>
Subject Re: Inability to back up and restore properly
Date Thu, 09 Apr 2009 02:45:26 GMT
On Tue, Apr 7, 2009 at 10:59 AM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
>
> On Mon, Apr 6, 2009 at 11:04 PM, Jeff Hinrichs - DM&T
> <jeffh@dundeemt.com> wrote:
> > What is the proper way to backup and restore a couchdb?  I mean a real
> > proper dump/load cycle.
> >
> > couchdb doesn't provide a way to do a proper dump/reload cycle which leaves
> > us to try and write our own.  However, if you dump a document like
> >
> > {'_id':'foo',''_rev':'2-xyz',...}
> >
> > There is not a single way that I can find to load an empty database and
> > recreate that same record.  If you put the
> > {'_id':'foo',''_rev':'2-xyz',...}, you get
> > {'_id':'foo',''_rev':'3-mno',...},  which is not the same as
> > {'_id':'foo',''_rev':'2-xyz',...}.
> >
>
> You'll need to look into the bulk_docs api as it's used by
> replication. Admittedly this isn't properly documented just yet but
> that is exactly the endpoint that would allow for this.
>
> Glancing at the code, it looks like to accomplish what you want it'd
> be something like:
>
> POST /db_name/_bulk_docs
>
> {
>    "new_edits": false,
>    "docs": [
>        {"_id": "foo", "_rev": "2-xyz", ...}
>    ]
> }
>
> > In some use cases it is necessary to be able to restore data to the way it
> > was at a point in time.  Sometimes for logic reasons, some times for error
> > recovery and debugging and some times for legal reasons.  The seemingly only
> > way possible to do that is to bring up another couchdb instance and
> > replicate to it.  However, that is a bit problematic for normal long term
> > storage methodologies.
> >
> > What is the API I should be using?   If no such api exists, is it an
> > oversight or just a matter of resources?  There should be a way to load data
> > into couch and have couchdb just accept it, keeping the _rev information
> > that is passed.  I am not proposing to change the mode of operation, but to
> > create a new one.  Even better would be to have couchdb do a /database/_dump
> > that streams out documents and a post /database/_load with the posted file
> > from the /database/_dump.
> >
> > so that you have some couchdb database foo in state 'A', you dump, then
> > create database bar, and load the dump from foo and when the process is
> > finished, a replication from foo state 'A' to bar results in
> > {"start_time":"Tue, 07 Apr 2009 03:02:16 GMT","end_time":"Tue, 07 Apr 2009
> > 03:02:16
> > GMT","start_last_seq":0,"end_last_seq":100,"missing_checked":100,"missing_found":0,"docs_read":0,"docs_written":0,"doc_write_failures":0}
> >
> >
> > Regards,
> >
> > Jeff Hinrichs
> >
>
> You should also check out couchdb-python's dump-load scripts and Chris
> Anderson's Ruby script on the breaking changes 0.9 page for other
> examples of scripts to dump/reload a database.

I am aware of the python scripts, (I wrote and submitted my own to
overcome problems with dumping large databases.) however they are not
true dump/load scripts in the sense that the database created from
loading a dump, is not the same as the original.  Caused by _rev being
modified when put'ing the information.   What I am really looking for
is best imagined if you do a thought experiment and assume you can
replicate to a file.  And then replicate from that file to an empty
couchdb database.  In the end you would end up with two databases
that were the same and replication between them occurs with no
unexpected results.

Currently, if you dump from a database A whose records have
attachments and then load that dump file into a new empty database B
and then replicate from A->B things don't work right.  It looks like
it replicated, but if you check the file size, you will note that it
has grown to 2x the size of the attachments. No amount of compaction
will reduce the database size. which is now
SizeOfDocs+2xSizeOfAttachments

I will take a look at bulk docs and see if it helps, but I don't have
high hopes, because by spec, when you insert or update a record, the
_rev is automatically incremented and returned.  This pretty much
guarantees that A will never equal B and that replication from A->B is
broken since the _revs in B look different then the _revs in A

I'll post back if I get unexpected positive results from using the
bulk_update api.

> HTH,
> Paul Davis

Regards,

Jeff Hinrichs

Mime
View raw message