couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Davis" <paul.joseph.da...@gmail.com>
Subject Re: Bulk Load
Date Thu, 18 Sep 2008 17:51:07 GMT
So two things here. First the update steps similar to before:

1. Get current version of document id X
2. Clone doc X making doc Y
3. Make doc Y a history doc:
    a. Y._id = new_uuid()
    b. Y.is_current_revision = false
    c. Delete Y._rev
    e. X.previous_version = Y._id
4. Edit doc X as desired
5. In a single HTTP request, send both documents to _bulk_docs

So, given a document A, we have a current history of A.previous = C._id
Getting A, clone A to get B and edit as per step 3
Now we have A.previous = B, B.previous = C or A -> B -> C

To make this permanent, we post to _bulk_docs both A and B. If A was
edited simultaneously, our A will get rejected as will B. So nothing
changed, you'd resolve this for as per any other normal situation.

This will work in face of replication. Just the same as per any other
replication we may have to resolve conflicts, but our histories should
never conflict. What this system does introduce is this:

Given that someone did the A->B->C above, say someone else
simultaneously A->D->C and we replicate.

B, C, and D will not conflict.

The two versions of A will. We resolve this as we would have for any
case. Then we indicate that A now has *two* previous histories. A-> (B
or D) -> C

Full Stop.

Using the built in revisioning for app dependent revisioning is a Bad
Idea &trade;. Its not meant for that and shouldn't be relied on. I am
not saying "Don't use the built in rev-control for rev-control." I'm
saying "Don't use the builtin collision detection system that is not
at all meant for rev-control for rev-control." I know, shades of gray
and all.

The single node case can't be handled by the internal revision
control. You may think it can. It may look like it can. But it just
can't. You'll be whistling along and then wham! Something will happen
and you'll be up shit creek. (something will happen = accidental
compaction, need for replication, changes to couchdb internals
invalidating this approach, meteor hits your datacenter, you get the
idea)

We can't use the internal _rev system for multi-node stuff because old
revisions are never replicated. Not even attempted at being
replicated. CouchDB idealogy says that there is one version of each
document, the most recent revision. Yes, it is possible to obtain
previous revisions making it look like revision control, but that's an
effect of implementation and hence should not be relied upon (Caveats
apply, using for things like undo etc are probably kosher as long as
you handle the possibly missing document etc).

HTH,
Paul


On Thu, Sep 18, 2008 at 11:13 AM, Ronny Hanssen <super.ronny@gmail.com> wrote:
> Ok, I get it... I understand bulk_docs is atomic, but I missed out on that
> you actually preserved the *original* doc.id (doh). I thought that with
> clone you meant a new doc in CouchDB, with it's own id. And I just couldn't
> understand why you did that :). This now makes more sense to me. Sorry.
>> As to replication, what you'd need is a flag that says if a particular
>> node is the head node. Then your history docs should never clash. If
>> you get conflicts on the head node you resolve them and store all
>> conflicting previous revisions. In this manner your linked list
>> becomes a linked directed acyclic graph. (Yay college) This does mean
>> that at any given point in the history you could possibly have
>> multiple versions of the same doc, but replication works.
>
> Ok, but how is that flag supposed to be set? At the time of inserting with
> _bulk_docs the system needs to update the current, which means that any node
> racing during an update will flag it to be current and actual. Which means
> that replication in race conditions will conflict(?).
>
> I am just asking because the single node case could be handled by the
> internal CouchDB revision control. So, using the elaborate scheme you
> propose isn't really helping for that scenario. My impression was that we
> cannot use the internal CouchDB due to the difficulties in handling
> conflicts with multiple nodes involved (because conflicts could/would
> occur), and that this would be better handled by manual hand-coded
> rev-control.
>
> It seems to me that there are no solutions on how to do this by hand coding
> either. So, it seems we are saying "don't use the built-in rev-control for
> rev-control of data" to avoid people blaming CouchDB when the built in
> "revision control" conflicts.
>
> Thanks for your patience guys.
>
> ~Ronny
>
> 2008/9/18 Paul Davis <paul.joseph.davis@gmail.com>
>
>> Ronny,
>>
>> There are two points that I think you're missing.
>>
>> 1. _bulk_docs is atomic. As in, if one doc fails, they all fail.
>> 2. I was trying to make sure that the latest _id of a doc is constant.
>>
>> Think of this as a linked list. You grab the head document (most
>> current revision) and clone it. Then we change the uuid of the second
>> doc and make our pointer links to fit into the list. Then after making
>> the necessary changes, we edit the head node to our desire. Now we
>> post *both* (in the same HTTP request!) docs to _bulk_docs. This
>> ensures that if someone else edited this particular doc, the revisions
>> will be different and the second edit would fail. Thus, on success 2
>> docs are inserted, on failure, 0 docs.
>>
>> As to replication, what you'd need is a flag that says if a particular
>> node is the head node. Then your history docs should never clash. If
>> you get conflicts on the head node you resolve them and store all
>> conflicting previous revisions. In this manner your linked list
>> becomes a linked directed acyclic graph. (Yay college) This does mean
>> that at any given point in the history you could possibly have
>> multiple versions of the same doc, but replication works.
>>
>> For views, you'd just want to have a flag that says "Not the most
>> recent version." Then in your view you would know whether to emit
>> key/value pairs for it. This could be something like "No next version
>> pointer" or some such. Actually, this couldn't be a next pointer
>> without two initial gets because you'd need to get the head node and
>> next node. A boolean flag indicating head node status would be
>> sufficient though. And then you could have a history view if you ever
>> need to walk from tail to head
>>
>> HTH,
>> Paul
>>
>>
>> On Wed, Sep 17, 2008 at 9:35 PM, Ronny Hanssen <super.ronny@gmail.com>
>> wrote:
>> > Hm.
>> >
>> > In Paul's case I am not 100% sure what is going on. Here's a use case for
>> > two concurrent edits:
>> >  * First two users get the original.
>> >  * Both makes a copy which they save.
>> > This means that there are two fresh docs in CouchDB (even on a single
>> > node).
>> >  * Save the original using a new doc._id (which the copy is to persist in
>> > copy.previous_version).
>> > This means that the two new docs know where to find their  previous
>> > versions. The problem I have with this scheme is that every change of a
>> > document means that it needs to store not only the new version, but also
>> > it's old version (in addition to the original). The fact that two racing
>> > updates will generate 4(!) new docs in addition to the original document
>> is
>> > worrying. I guess Paul also want the original to be marked as deleted in
>> the
>> > _bulk_docs? But, in any case the previous version are now new two new
>> docs,
>> > but they look exactly the same, except for the doc._id, naturally...
>> >
>> > Wouldn't this be enough Paul?
>> > 1. old = get_doc()
>> > 2. update = clone(old);
>> > 3. update.previous_version = old._id;
>> > 4. post via _bulk_docs
>> >
>> > This way there won't be multiple old docs around.
>> >
>> > Jan's way ensures that for a view there is always only one current
>> version
>> > of a doc, since it is using the built-in rev-control. Competing updates
>> on
>> > the same node may fail which is then what CouchDB is designed to handle.
>> If
>> > on different nodes, then the rev-control history might come "out of
>> synch"
>> > via concurrent updates. How does CouchDB handle this? Which update wins?
>> On
>> > a single node this is intercepted when saving the doc. For multiple nodes
>> > they might both get a response saying "save complete". So, these then
>> needs
>> > merging. How is that done? Jan further on secures the previous version by
>> > storing the previous version as a new doc, allowing them to be persisted
>> > beyond compaction. I guess Jan's sample would benefit nicely from
>> _bulk_docs
>> > too. I like this method due to the fact that it allows only one current
>> doc.
>> > But, I worry about how revision control handles conflicts, Jan?
>> >
>> > Paul and my updated suggestion always posts new versions, not using the
>> > revision system at all. The downside is that there may be multiple
>> current
>> > versions around... And this is a bit tricky I believe... Anyone?
>> >
>> > Paul's suggestion also keeps multiple copies of the previous version. I
>> am
>> > not sure why, Paul?
>> >
>> >
>> > Regards,
>> > Ronny
>> >
>> > 2008/9/17 Paul Davis <paul.joseph.davis@gmail.com>
>> >
>> >> Good point chris.
>> >>
>> >> On Wed, Sep 17, 2008 at 11:39 AM, Chris Anderson <jchris@apache.org>
>> >> wrote:
>> >> > On Wed, Sep 17, 2008 at 11:34 AM, Paul Davis
>> >> > <paul.joseph.davis@gmail.com> wrote:
>> >> >> Alternatively something like the following might work:
>> >> >>
>> >> >> Keep an eye on the specifics of _bulk_docs though. There have been
>> >> >> requests to make it non-atomic, but I think in the face of something
>> >> >> like this we might make non-atomic _bulk_docs a non-default or
some
>> >> >> such.
>> >> >
>> >> > I think the need for non-transaction bulk-docs will be obviated when
>> >> > we have the failure response say which docs caused failure, that way
>> >> > one can retry once to save all the non-conflicting docs, and then loop
>> >> > back through to handle the conflicts.
>> >> >
>> >> > upshot: I bet you can count on bulk docs being transactional.
>> >> >
>> >> >
>> >> > --
>> >> > Chris Anderson
>> >> > http://jchris.mfdz.com
>> >> >
>> >>
>> >
>>
>

Mime
View raw message