couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Kogman <nothingm...@woobling.org>
Subject Re: reiterating transactions vs. replication
Date Fri, 22 May 2009 03:30:26 GMT
2009/5/21 Adam Kocoloski <kocolosk@apache.org>:
> Hi Yuval, thanks for this well-written proposal.  I don't really want to
> rehash all the discussion from back in February (see the thread beginning at
> http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c84F66023-030A-4669-B75C-3DCC92D71A78@yahoo.com%3e for
> a particularly detailed discussion), but I do want to comment on one aspect.
>
> Updating the replicator to be smart about atomic bulk transactions is doable
> (although a major undertaking), but when you throw DB compaction and
> revision stemming into the mix things get really hairy.  Recall that CouchDB
> revisions are used for concurrency control, not for maintaining history.
>  Consider the following sequence of events:
>
> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
> 2) Update foo -> foo/2
> Compact the DB (foo/1 is deleted)
> Start replicating to a mirror
> Replication crashes before it reaches foo/2

By crash you mean an error due to a conflict between foo/2 and foo/1'
(the mirror's version of foo), right?

> In your proposal, we should expect foo/1 to exist on the mirror, right?  I
> think this means we'd need to modify the compaction algorithm to keep
> revisions of documents if a) the revision was part of an atomic _bulk_docs,
> and b) any of the documents in that transaction are still at the revision
> generated by the transaction.  Same thing goes for revision stemming -- we
> can never drop revisions if they were part of an atomic upload and at least
> one of the document revs in the upload is still current.

Yep. Personally I see this is a tradeoff, not a limitation per se. If
you specify 'atomic' then you must pay more in terms of data size,
performance, etc.

In 0.8 you would have theoretically had to pay by default, but didn't
because replication broke transactions.

The basic algorithm is still the same, but the garbage collected unit
is changed (instead of garbage collecting document revisions it
garbage collects revision sets, with the current case being a set with
one member. The rules still apply (if this object is wholly shadowed
by non conflicting changes then it can be disposed of)). IIRC the
algorithm is a copying garbage collector, so this is pretty easy to do
(you walk a DAG instead of a linked list).

Under the proposed model you'd choose which operations are
transactional and will have to pay for those.


Anwyay, thanks for your link as well, I was reading through a rather
boring thread and didn't see this one, so I guess I did miss out. It
seemed to imply the discussion was done only on IRC.

Anyway, here goes...

The fundamental problem is that any consistent data model needs at the
very least to have atomic primitives and ordered message passing (with
transactional message handlers) at the per-partition level, or
atomicity and consistency is restricted to a single document.

What concerns me is Damien's post
(http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%3c451872B8-152C-42A6-9324-DD52534D9A32@apache.org%3e):

> No, CouchDB replication doesn't support replicating the transactions.
> Never has, never will. That's more like transaction log replication
> that's in traditonal dbs, a different beast.
>
> For the new bulk transaction model, I'm only proposing supporting
> eventual consistency. All changes are safe to disk, but the db may not
> be in a consistent state right away.

>From what I know this assumption is wrong. Eventual consistency still
needs atomic primitives, it's not about whether or not you have
transactions, it's about what data they affect (eventual consistency
involves breaking them down).

Anyway, "never will" sounds pretty binding, but for the sake of argument:

By using only insertions and idempotent updates for the bulk of the
data changes and a message queue whose handlers use atomic updates to
integrate this data one can implement a truly atomic distributed
model, or an eventual consistency, but without this updates need to be
restricted to exactly one document.

Eventual consistency is still possible using either locks or by
breaking down what would have been large distributed transactions into
smaller ones, but the key is that the code that will make things
actually consistent must still have ACID guarantees (and be dispatched
in order).

The 0.9 model CouchDB is effectively MyISAM without data loss, but
just because the data is around doesn't mean it's possible to know
what to do with it (loss of context), or even fix it safely (the
conflict resolution code is susceptible to conflicts too).

Unfortunately for eventual consistency to actually work the breaking
down of operations must be done on application level, the database
can't decide which data can be deferred and which data cannot.

All immutable data and all new data can obviously be added to the
database outside of a transaction, but eventually a transaction
linking this data must be part of an atomic mutation.

The only way to support this without atomic operations on a unit
larger than a document, is to have a "master" document for every
transitive closure the graph structure requiring consistency, which in
effect only actually relates to immutable snapshot documents (e.g.
where the ID is a hash of the data). If these closures overlap then a
single "master" for the whole graph will be needed.


To illustrate, let's make up a social networking example. Let's say
you are adding a friend on this social network, and that this
operation involves 3 updates, one to add a link from your profile to
your friend's ID, another for the inverse, and a third update to
update to send a "hello" message to the friend, updating their inbox.
The first update lives in one partition, and the second and third
updates are on a second one.

The back pointers in your new friends must be updated. In an fully
transactional model this would lock the friend's document and yours at
the same time, in an eventual consistency model this would queue a
message for the friend's partition, and a message handler on the
friend's partition would update this atomically "eventually". It's
fine for the link to be out of date for a while, but eventually it
needs to be fixed (e.g. if you want to remove the friend, message
them, etc).

In couchdb 0.9 one of the writes will get a "conflict" error back, and
they could refetch the updated version and try the edit again. The
problem is that if the wrote the third update update to another
document on the same node making assumptions about the same data, that
write may have succeeded, leaving the data inconsistent. Under an
eventual consistency model you still use transactions to do these
updates, you just must design your model to break them down into
smaller units.

The reason a graph structure is more susceptible to inconsistency is
that while in a relational model many data linkage operations can be
done with a single insert/update (e.g. `insert into edges (node1_id,
node2_id)`), in a document based database this type of opreation
involves modifying all the affected documents. The chance of
inconsistency is increased because contention is higher and there is
more data that must be synchronized.

However, in another post Damien said:

> Which is why in general you want to avoid inter-document dependencies,
> or be relaxed in how you deal with them.

So I think I best shut up after this without some decision maker
telling me not to, if my use case is not covered by the intended
design then that's that, but I do think this thread sort of covers
this:

> As far as distributed transactions go, I'd be thrilled if we could
> implement it and also support the rest of couchdb, like views and bi-
> directional replication. Please start up a discussion here in dev@
> about it and see if you can work out a design.

Without going too pie-in-the-sky.

Cheers,
Yuval

Mime
View raw message