cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Blake Eggleston (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6246) EPaxos
Date Sat, 20 Sep 2014 00:10:35 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141590#comment-14141590
] 

Blake Eggleston commented on CASSANDRA-6246:
--------------------------------------------

I'm still poring over the discussion in CASSANDRA-5062, and the current implementation, but
wanted to expand on some of the advantages, list a few disadvantages and caveats of using
egalitarian paxos, and talk about a few areas where we'd probably want to deviate from the
process as described in the paper by Moraru et al.

Advantages:
* In the ideal case we should be able to answer a client's query after the same number of
inter-node messages it takes to do a quorum write. (There will be more total messages, but
we don't need to wait for them to complete before responding to the client)
** This is assuming that each node performs the cas locally instead of using paxos to setup
a quorum read/write
* Even in the non-ideal case, you're still looking at 2 network round trips before reaching
commit (it looks like current impl has 4 network round trips for cas?)
* Much higher throughput on interfering queries is possible. Multiple in-flight queries on
the same row is not a problem.
** livelock is not a risk during normal operation, only during failure recovery. However,
this can be mitigated by specifying an order of succession for query leaders. Of course, really
heavy 'normal' operation might start causing failure cases.
* Granular control over which operations interfere with each other

Disadvantages:
* the epaxos optimizations are possible because it has a pretty complex failure recovery procedure
* the concurrent programming side of things will be more complicated than the current implementation
* because execution is more asynchronous than classic paxos, I think we'd have to perform
the operations locally rather than using paxos to setup a normal quorum read/write. On one
hand, this saves us a network round trip. On the other hand, if people are doing non-serialized
writes at the same time as serialized writes that affect the same cells, it's likely that
different nodes will record different results for a query. Obviously, it's not a good idea
to do this, but that doesn't mean people won't do it.

Caveats:
* with rf>3, or a non-replica coordinator, responses from more than a quorum of replicas
_may_ be needed to commit on the ideal case. Or we just use the 2 message commit path in those
situation. I'm still working out the details, but I'm pretty sure there are failure scenarios
where not doing that could result in different values can be committed after recovery.
* Epaxos is pretty new. I was talking to the authors about it a few months ago, and the only
implementations we were aware of were mine and theirs... I'm pretty sure there aren't any
production deployments of it. That's not _neccesarily_ a bad thing, but I just wanted to point
out that we are in fairly new territory, and that should be weighed against the advantages.
There is no 'Making EPaxos Live' paper out there.

Places where Cassandra's architecture will likely require doing things a bit differently than
outlined in the paper:
* Sequence values will cause problems, but they shouldn't be neccesary.
*# since each node is responsible for different ranges of data, and therefore would have seen
different queries, encountering different seq values would be very likely, and would result
in a lot of otherwise unnecessary accept phases. We could get around this by using different
seq values for different token ranges, but...
*# Since we'd wait until the query is actually executed before returning a result to the client
(don't know why we wouldn't), it's a superfluous requirement. I discussed this with Iulian
Moraru a few months ago and he agreed.
* Using a non-replica coordinator:
*# The paper assumes that an instance leader is also a replica of the data being queried.
I'd imagine we'd want to avoid optimistically forwarding queries to a single replica and hoping
it's up, which would mean allowing coordinators to lead queries for keys they don't know anything
about. This would prevent the non-leaders from recording that they agree with the leader,
preventing some optimizations in failure recovery. It would make a good case for using prepared
statements and token aware routing.


> EPaxos
> ------
>
>                 Key: CASSANDRA-6246
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos
requires leader election and hence, a period of unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly
useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to implement
it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message