incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Davis" <paul.joseph.da...@gmail.com>
Subject Re: Bulk Load
Date Thu, 18 Sep 2008 14:37:18 GMT
Ronny,

There are two points that I think you're missing.

1. _bulk_docs is atomic. As in, if one doc fails, they all fail.
2. I was trying to make sure that the latest _id of a doc is constant.

Think of this as a linked list. You grab the head document (most
current revision) and clone it. Then we change the uuid of the second
doc and make our pointer links to fit into the list. Then after making
the necessary changes, we edit the head node to our desire. Now we
post *both* (in the same HTTP request!) docs to _bulk_docs. This
ensures that if someone else edited this particular doc, the revisions
will be different and the second edit would fail. Thus, on success 2
docs are inserted, on failure, 0 docs.

As to replication, what you'd need is a flag that says if a particular
node is the head node. Then your history docs should never clash. If
you get conflicts on the head node you resolve them and store all
conflicting previous revisions. In this manner your linked list
becomes a linked directed acyclic graph. (Yay college) This does mean
that at any given point in the history you could possibly have
multiple versions of the same doc, but replication works.

For views, you'd just want to have a flag that says "Not the most
recent version." Then in your view you would know whether to emit
key/value pairs for it. This could be something like "No next version
pointer" or some such. Actually, this couldn't be a next pointer
without two initial gets because you'd need to get the head node and
next node. A boolean flag indicating head node status would be
sufficient though. And then you could have a history view if you ever
need to walk from tail to head

HTH,
Paul


On Wed, Sep 17, 2008 at 9:35 PM, Ronny Hanssen <super.ronny@gmail.com> wrote:
> Hm.
>
> In Paul's case I am not 100% sure what is going on. Here's a use case for
> two concurrent edits:
>  * First two users get the original.
>  * Both makes a copy which they save.
> This means that there are two fresh docs in CouchDB (even on a single
> node).
>  * Save the original using a new doc._id (which the copy is to persist in
> copy.previous_version).
> This means that the two new docs know where to find their  previous
> versions. The problem I have with this scheme is that every change of a
> document means that it needs to store not only the new version, but also
> it's old version (in addition to the original). The fact that two racing
> updates will generate 4(!) new docs in addition to the original document is
> worrying. I guess Paul also want the original to be marked as deleted in the
> _bulk_docs? But, in any case the previous version are now new two new docs,
> but they look exactly the same, except for the doc._id, naturally...
>
> Wouldn't this be enough Paul?
> 1. old = get_doc()
> 2. update = clone(old);
> 3. update.previous_version = old._id;
> 4. post via _bulk_docs
>
> This way there won't be multiple old docs around.
>
> Jan's way ensures that for a view there is always only one current version
> of a doc, since it is using the built-in rev-control. Competing updates on
> the same node may fail which is then what CouchDB is designed to handle. If
> on different nodes, then the rev-control history might come "out of synch"
> via concurrent updates. How does CouchDB handle this? Which update wins? On
> a single node this is intercepted when saving the doc. For multiple nodes
> they might both get a response saying "save complete". So, these then needs
> merging. How is that done? Jan further on secures the previous version by
> storing the previous version as a new doc, allowing them to be persisted
> beyond compaction. I guess Jan's sample would benefit nicely from _bulk_docs
> too. I like this method due to the fact that it allows only one current doc.
> But, I worry about how revision control handles conflicts, Jan?
>
> Paul and my updated suggestion always posts new versions, not using the
> revision system at all. The downside is that there may be multiple current
> versions around... And this is a bit tricky I believe... Anyone?
>
> Paul's suggestion also keeps multiple copies of the previous version. I am
> not sure why, Paul?
>
>
> Regards,
> Ronny
>
> 2008/9/17 Paul Davis <paul.joseph.davis@gmail.com>
>
>> Good point chris.
>>
>> On Wed, Sep 17, 2008 at 11:39 AM, Chris Anderson <jchris@apache.org>
>> wrote:
>> > On Wed, Sep 17, 2008 at 11:34 AM, Paul Davis
>> > <paul.joseph.davis@gmail.com> wrote:
>> >> Alternatively something like the following might work:
>> >>
>> >> Keep an eye on the specifics of _bulk_docs though. There have been
>> >> requests to make it non-atomic, but I think in the face of something
>> >> like this we might make non-atomic _bulk_docs a non-default or some
>> >> such.
>> >
>> > I think the need for non-transaction bulk-docs will be obviated when
>> > we have the failure response say which docs caused failure, that way
>> > one can retry once to save all the non-conflicting docs, and then loop
>> > back through to handle the conflicts.
>> >
>> > upshot: I bet you can count on bulk docs being transactional.
>> >
>> >
>> > --
>> > Chris Anderson
>> > http://jchris.mfdz.com
>> >
>>
>

Mime
View raw message