cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Blake Eggleston (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6246) EPaxos
Date Mon, 01 Dec 2014 19:52:13 GMT


Blake Eggleston commented on CASSANDRA-6246:

Since it looks like the performance improvements from epaxos could be worth the (substantial)
added complexity, I’ve been thinking through problems are caused by the need to  garbage
collect instances, and repair causing inconsistencies by sending data from ‘the future’.

For repair, the only thing I’ve thought of that would work 100% of the time would be to
count executed instances for a partition, and to send that count along with the repair request.
If the remote count is higher than the local count, we know for sure that it has data from
the future, and the repair for that partition should be deferred.

For garbage collection, we’ll need to support a failure recovery mode that works without
all historical instances. We also need a way to quickly determine if a prepare phase should
be used, or we need a epaxos repair type operation to bring a node up to speed.

Breaking the continuous execution space of partition ranges into discrete epochs would give
us a relatively straightforward way of solving all of these problems. Each partition range
will have it’s own epoch number. At a given instance number threshold, time threshold, or
event, epaxos will run an epoch increment instance. It will take every active instance in
it’s partition range as a dependency. Any instance executed before the epoch instance belongs
to the last epoch, any executed after belong to the new one.

How this would solve the outstanding problems:

Garbage Collection: Any instance from 2 or more epochs ago can be deleted. Although epoch
incrementing instances doesn’t prevent dependencies on the previous epoch, it does prevent
dependencies from the previous-1 epoch

Repair: Counting executions allows us to determine if repair data is from the future. Epochs
let us scope execution counts to an epoch. If the epoch has incremented twice without new
executions for a partition, the bookkeeping data for that partition can be deleted. This gives
us a race free way to delete old execution counts, preventing keeping bookkeeping data around

Failure recovery: Using epochs makes deciding to use prepare or failure recovery unambiguous.
If a node is missing instances that are from 2 or more epochs ago, it will need to run a failure
recovery. Otherwise, prepare phases will work. Additionally, using an epaxos instance as the
method of incrementing epochs guarantees that a given instance has been executed once the
epoch has been incremented twice.

> EPaxos
> ------
>                 Key: CASSANDRA-6246
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Blake Eggleston
>            Priority: Minor
> One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos
requires leader election and hence, a period of unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly
useful across multiple datacenters, and (3) allows any node to act as coordinator:
> However, there is substantial additional complexity involved if we choose to implement

This message was sent by Atlassian JIRA

View raw message