couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: reiterating transactions vs. replication
Date Thu, 21 May 2009 17:48:33 GMT
Hi Yuval, thanks for this well-written proposal.  I don't really want  
to rehash all the discussion from back in February (see the thread  
beginning at http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e

  for a particularly detailed discussion), but I do want to comment on  
one aspect.

Updating the replicator to be smart about atomic bulk transactions is  
doable (although a major undertaking), but when you throw DB  
compaction and revision stemming into the mix things get really  
hairy.  Recall that CouchDB revisions are used for concurrency  
control, not for maintaining history.  Consider the following sequence  
of events:

1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
2) Update foo -> foo/2
Compact the DB (foo/1 is deleted)
Start replicating to a mirror
Replication crashes before it reaches foo/2

In your proposal, we should expect foo/1 to exist on the mirror,  
right?  I think this means we'd need to modify the compaction  
algorithm to keep revisions of documents if a) the revision was part  
of an atomic _bulk_docs, and b) any of the documents in that  
transaction are still at the revision generated by the transaction.   
Same thing goes for revision stemming -- we can never drop revisions  
if they were part of an atomic upload and at least one of the document  
revs in the upload is still current.

Do you agree?  Best, Adam

On May 21, 2009, at 7:00 AM, Yuval Kogman wrote:

> Hello,
>
> In 0.9 CouchDB removed the transactional bulk docs feature in favour
> of simplifying sharding/replication.
>
> The priorities behind this decision as I understood them are:
>
>    1. ensure that applications developed in a single server don't
> suffer from a degradation of guarantees if deployed using sharding
>
>    2. avoid the issues involving transactional
>
>
> I apologize if this proposal has already dismissed before. I did
> search the mailing list archives, but mostly found a discussion on why
> this stuff should not be done on IRC. I blame Jan for encouraging me
> to post ;-)
>
>
>
> So anyway, I think that we can have both features without needing to
> implement something like Paxos, and without silently breaking apps
> when they move to a sharding setup from a single machine setup.
>
>
> The basic idea is to keep the conflicts-are-data approach, keeping the
> current user visible replication and conflict resolution, but to allow
> the user to ask for stricter conflict checking.
>
> The api I propose is for the bulk docs operation to have an optional
> 'atomic' flag. When this flag is set CouchDB would atomically verify
> that all documents were committed without conflict (with respect to
> the supplied _rev), and if any one document conflicts, mark all of
> them as conflicting.
>
> Transaction recovery, conflict resolution etc is still the
> responsibility of the application, but provides an atomic guarantee
> that an inconsistent transaction will fail as a whole if it tries to
> write inconsistent data to the database, a guarantee that cannot be
> made using a client library (there are race conditions).
>
>
>
> Now the hard parts:
>
>
> 1. Replication
>
> The way I understand it replication currently works on a per document
> approach. If 'atomic' was specified in a bulk operation I propose that
> all the revisions created in that bulk operation were kept linked. If
> these linked revisions are being replicated, the same conflict
> resolution must be applied (the replication of the document itself is
> executed as bulk operation with aotmic=true, replicating all
> associated documents as well).
>
> The caveat is that even if you always use bulk docs with the atomic
> flag, if you a switch replica you could lose the D out of ACID:
> documents which are marked as non conflicting in your replica might be
> conflicting in the replica you switch to, in which case transactions
> that have already been committed appear to be rolled back from the
> application's point of view.
>
> This problem obviously already exists in the current implementation,
> but when 'atomic' is specified it could potentially happen a lot more
> often.
>
>
> 2. Data sharding
>
> This one is tricker. Two solutions both of which I think are
> acceptable, and either or both of which could be used:
>
>
> The easy way is to ignore this problem. Well not really: The client
> must ensure that all the documents affected by a single transaction
> are in the same shard, by using a partitioning scheme that allows
> this.
>
> If a bulk_docs operation with atomic set to true would affect multiple
> shards, that is an error (the data could still be written as a
> conflict, of course).
>
> If you want to enable the 'atomic' flag you'll need to be careful
> about how you use sharding. You can still use it for some of the
> transactions, but not all the time. I think this is a flexible and
> pragmatic solution.
>
> This means that if you choose to opt in to fully atomic bulk doc
> operations your app might not be deployable unmodified to a sharded
> setup, but it's never unsafe (no data inconsistencies).
>
> In my opinion this satisfies the requirement for no degredation of
> guarantees. It might not Just Work, but you can't have your cake and
> eat it too at the end of the day.
>
>
>
>
> The second way is harder but potentially still interesting. I've
> included it mostly for the sake of discussion.
>
> The core idea is to provide low level primitives on top of which a
> client or proxy can implement a multi phase commit protocol.
>
> The number of nodes involved is in the transaction depends on the data
> in the transaction (it doesn't need to coordinate all the nodes in the
> cluster).
>
> Basically this would breakdown bulk doc calls into several steps.
> First all the data is inserted to the backend, but it's set as
> conflicting so that it's not accidentally visible.
>
> This operation returns an identifier for the bulk doc operation
> (essentially a ticket for a prepared transaction).
>
> Once the data is available on all the shards it must be made live
> atomically. A two phase commit starts by acquiring locks on all the
> the transaction tickets and trying to apply them (the 'promise'
> phase), and then finalizing that application atomically (the 'accept'
> phase).
>
> To keep things simple the two phases should be scoped to a single keep
> alive connection. If the connection drops the locks should be
> released.
>
> Obviously Paxos ensues, but here's the catch:
>
> - The synchronization can be implemented first as a 3rd party
> component, it doesn't need to affect CouchDB's core
>
> - The atomic primitives are also useful for writing safe conflict
> resolution tools that handle conflicts that span multiple documents.
>
> So even if no one ends up implementing real Multi Paxos in the end
> CouchDB still benefits from having reusable synchronization
> primitives. (If this is interesting to anyone, see below [1])
>
>
>
>
> I'd like to stress that this is all possible to do on top of the
> current 0.9 semantics. The default behavior in 0.9 does not change at
> all. You have to opt in to this more complex behavior.
>
> The problem with 0.9 is that there is no way to ensure atomicity and
> isolation from a client library, it must be done on the server, so by
> removing the ability to do this at all, couchdb is essentially no
> longer transaction. It's protected from internal data corruption, and
> it's protected from data loss (unlike say MyISAM which will happily
> overwrite your correct data), but it's still a potentially lossy model
> since conflicting revisions are not correlated. This means that you
> cannot have a transactional graph model, it's either or.
>
>
>
> Finally, you should know that I have no personal stake in this. I
> don't rely on CouchDB (yet), but I think it's somewhat relevant for a
> project I work on, and that the requirements for this project are not
> that far fetched. I'm the author of an object graph storage engine for
> Perl called KiokuDB. It serializes every object to a document unless
> told otherwise but the data is still a highly interconnected graph. As
> a user of this system I care a lot about transactions, but not at all
> about sharding (this might not hold for all the users of KiokuDB).
> Basically I already have what I need from KiokuDB; there are numerous
> backends for this system that satisfy me (BerkeleyDB, a transactional
> plain file backend, DBI (PostgreSQL, SQLite, MySQL)), and a number
> that don't fit my personal needs due to lack of transactions
> (SimpleDB, MongoDB, and CouchDB since 0.9).
>
> If things stays this way, then CouchDB is simply not intended for
> users like me (though I'll certainly still maintain
> KiokuDB::Backend::CouchDB).
>
> However, I do *want* to use CouchDB. I think that under many scenarios
> it has clear advantages compared to the other backends (mostly the
> fact that it's so easy, but also views support is nice). I think it'd
> be a shame if what was preventing me was a fix that ended up being a
> low hanging fruit to which no one objected.
>
>
> Regards,
> Yuval
>
>
>
> [1] in Paxos terms the CouchDB shards would do the Acceptor role and
> the client (be it the actual client or a sharding proxy, whatever
> delegates and consolidates the views) performs the the Learner role.
> Only the Learner is considered authoritative with respect to the final
> status of a transaction.
>
> Hard crashes of a shard during the 'accept' phase may produce
> inconsistent results if more than one Learner is used to proxy write
> operations. Focusing on this scenario is a *HUGE* pessimization. High
> availability of the Learner role can still be achieved using
> BerkeleyDB style master failover (voting).
>
> This transactional sharding proxy could of course also guarantee
> redundancy of shards.
>
> My point is that Paxos can be implemented as a 3rd party component if
> anyone actually wants/needs it, by providing comparatively simple
> primitives.


Mime
View raw message