couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ronny Hanssen" <super.ro...@gmail.com>
Subject Re: Bulk Load
Date Thu, 18 Sep 2008 15:13:29 GMT
Ok, I get it... I understand bulk_docs is atomic, but I missed out on that
you actually preserved the *original* doc.id (doh). I thought that with
clone you meant a new doc in CouchDB, with it's own id. And I just couldn't
understand why you did that :). This now makes more sense to me. Sorry.
> As to replication, what you'd need is a flag that says if a particular
> node is the head node. Then your history docs should never clash. If
> you get conflicts on the head node you resolve them and store all
> conflicting previous revisions. In this manner your linked list
> becomes a linked directed acyclic graph. (Yay college) This does mean
> that at any given point in the history you could possibly have
> multiple versions of the same doc, but replication works.

Ok, but how is that flag supposed to be set? At the time of inserting with
_bulk_docs the system needs to update the current, which means that any node
racing during an update will flag it to be current and actual. Which means
that replication in race conditions will conflict(?).

I am just asking because the single node case could be handled by the
internal CouchDB revision control. So, using the elaborate scheme you
propose isn't really helping for that scenario. My impression was that we
cannot use the internal CouchDB due to the difficulties in handling
conflicts with multiple nodes involved (because conflicts could/would
occur), and that this would be better handled by manual hand-coded
rev-control.

It seems to me that there are no solutions on how to do this by hand coding
either. So, it seems we are saying "don't use the built-in rev-control for
rev-control of data" to avoid people blaming CouchDB when the built in
"revision control" conflicts.

Thanks for your patience guys.

~Ronny

2008/9/18 Paul Davis <paul.joseph.davis@gmail.com>

> Ronny,
>
> There are two points that I think you're missing.
>
> 1. _bulk_docs is atomic. As in, if one doc fails, they all fail.
> 2. I was trying to make sure that the latest _id of a doc is constant.
>
> Think of this as a linked list. You grab the head document (most
> current revision) and clone it. Then we change the uuid of the second
> doc and make our pointer links to fit into the list. Then after making
> the necessary changes, we edit the head node to our desire. Now we
> post *both* (in the same HTTP request!) docs to _bulk_docs. This
> ensures that if someone else edited this particular doc, the revisions
> will be different and the second edit would fail. Thus, on success 2
> docs are inserted, on failure, 0 docs.
>
> As to replication, what you'd need is a flag that says if a particular
> node is the head node. Then your history docs should never clash. If
> you get conflicts on the head node you resolve them and store all
> conflicting previous revisions. In this manner your linked list
> becomes a linked directed acyclic graph. (Yay college) This does mean
> that at any given point in the history you could possibly have
> multiple versions of the same doc, but replication works.
>
> For views, you'd just want to have a flag that says "Not the most
> recent version." Then in your view you would know whether to emit
> key/value pairs for it. This could be something like "No next version
> pointer" or some such. Actually, this couldn't be a next pointer
> without two initial gets because you'd need to get the head node and
> next node. A boolean flag indicating head node status would be
> sufficient though. And then you could have a history view if you ever
> need to walk from tail to head
>
> HTH,
> Paul
>
>
> On Wed, Sep 17, 2008 at 9:35 PM, Ronny Hanssen <super.ronny@gmail.com>
> wrote:
> > Hm.
> >
> > In Paul's case I am not 100% sure what is going on. Here's a use case for
> > two concurrent edits:
> >  * First two users get the original.
> >  * Both makes a copy which they save.
> > This means that there are two fresh docs in CouchDB (even on a single
> > node).
> >  * Save the original using a new doc._id (which the copy is to persist in
> > copy.previous_version).
> > This means that the two new docs know where to find their  previous
> > versions. The problem I have with this scheme is that every change of a
> > document means that it needs to store not only the new version, but also
> > it's old version (in addition to the original). The fact that two racing
> > updates will generate 4(!) new docs in addition to the original document
> is
> > worrying. I guess Paul also want the original to be marked as deleted in
> the
> > _bulk_docs? But, in any case the previous version are now new two new
> docs,
> > but they look exactly the same, except for the doc._id, naturally...
> >
> > Wouldn't this be enough Paul?
> > 1. old = get_doc()
> > 2. update = clone(old);
> > 3. update.previous_version = old._id;
> > 4. post via _bulk_docs
> >
> > This way there won't be multiple old docs around.
> >
> > Jan's way ensures that for a view there is always only one current
> version
> > of a doc, since it is using the built-in rev-control. Competing updates
> on
> > the same node may fail which is then what CouchDB is designed to handle.
> If
> > on different nodes, then the rev-control history might come "out of
> synch"
> > via concurrent updates. How does CouchDB handle this? Which update wins?
> On
> > a single node this is intercepted when saving the doc. For multiple nodes
> > they might both get a response saying "save complete". So, these then
> needs
> > merging. How is that done? Jan further on secures the previous version by
> > storing the previous version as a new doc, allowing them to be persisted
> > beyond compaction. I guess Jan's sample would benefit nicely from
> _bulk_docs
> > too. I like this method due to the fact that it allows only one current
> doc.
> > But, I worry about how revision control handles conflicts, Jan?
> >
> > Paul and my updated suggestion always posts new versions, not using the
> > revision system at all. The downside is that there may be multiple
> current
> > versions around... And this is a bit tricky I believe... Anyone?
> >
> > Paul's suggestion also keeps multiple copies of the previous version. I
> am
> > not sure why, Paul?
> >
> >
> > Regards,
> > Ronny
> >
> > 2008/9/17 Paul Davis <paul.joseph.davis@gmail.com>
> >
> >> Good point chris.
> >>
> >> On Wed, Sep 17, 2008 at 11:39 AM, Chris Anderson <jchris@apache.org>
> >> wrote:
> >> > On Wed, Sep 17, 2008 at 11:34 AM, Paul Davis
> >> > <paul.joseph.davis@gmail.com> wrote:
> >> >> Alternatively something like the following might work:
> >> >>
> >> >> Keep an eye on the specifics of _bulk_docs though. There have been
> >> >> requests to make it non-atomic, but I think in the face of something
> >> >> like this we might make non-atomic _bulk_docs a non-default or some
> >> >> such.
> >> >
> >> > I think the need for non-transaction bulk-docs will be obviated when
> >> > we have the failure response say which docs caused failure, that way
> >> > one can retry once to save all the non-conflicting docs, and then loop
> >> > back through to handle the conflicts.
> >> >
> >> > upshot: I bet you can count on bulk docs being transactional.
> >> >
> >> >
> >> > --
> >> > Chris Anderson
> >> > http://jchris.mfdz.com
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message