hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10641) Introduce Coordination Engine
Date Mon, 21 Jul 2014 21:42:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069344#comment-14069344
] 

Steve Loughran commented on HADOOP-10641:
-----------------------------------------

bq. This is a good idea in the abstract, but the notion of applying Amazon's process to a
volunteer open source project is problematic.



Consensus protocols are expected to provide proofs of the algorithms correctness; anything
derived from Paxos, Raft et al rely on those algorithms being considered valid, and the implementors
being able to understand the algorithms. Open source consensus protocol *implementations*
are expected to publish their inner workings, else they can't be trusted. I will site Apache
Zookeeper's [ZAB protocol|http://web.stanford.edu/class/cs347/reading/zab.pdf], and [Anubis's
consistent T-space model|http://www.hpl.hp.com/techreports/2005/HPL-2005-72.html], as examples
of two OSS products that I have used and implementations that I trust. 


bq.  In terms of the Hadoop contribution process, this is a novel requirement. 

Implementations of distributed consensus protocols already a one place where the team needs
people who understands the maths. If a team implementing a protocol aren't able to specify
it formally in some form or other: run. And if someone tries to submit changes to the core
protocols of an OSS implementation who can't prove that it works, I would hope that the patch
will be rejected. 


Which is why I believe this specific JIRA "provide an API and reference implementation of
distributed updates" is suitable for the criteria "provide a strict specification". I'm confident
that someone in the WanDisco dev team will be able to do this, and would make "understand
this specification" a pre req for anyone else doing their own implementation. 

Even so, we can't expect complete proofs of correctness. Which is why I said "any maths that
can be provided, and test cases".

For HADOOP-9361, the test cases were the main outcome: by enumerating invariants and pre/post
conditions, some places where we didn't have enough tests became apparent. These were mostly
failure modes of some operations (e.g. what happens when preconditions aren't met).

Derived tests are great as:
# Jenkins can run them; you can't get mathematicians to prove things during automated regression
tests.
# It makes it easier to decide if a test failure is due to an error in the test, or a failure
of the code. If a specification-derived test fails, then it is now due to either an error
in the specification or the code.

I think we need to do the same here: from a specification of the API, build the test cases
which can verify the behavior as well as local tests can. Those implementors of the back end
now get those tests alongside a specification which defines what they have to implement. 

The next issue becomes "can people implementing things understand the specification?". It's
why I used a notation that uses Python expressions and data structures; one that should be
easy to understand. It's also why users of the TLA+ stuff in the Java & C/C++ world tend
to use the curly-braced form of the language. 

I'm sorry if this appears harsh or that I've suddenly added a new criteria to what Hadoop
patches have to do, but given this Coordination Manager is proposed as a central part in a
future HDFS and YARN RM, then yes, we do have to define it properly. 


> Introduce Coordination Engine
> -----------------------------
>
>                 Key: HADOOP-10641
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10641
>             Project: Hadoop Common
>          Issue Type: New Feature
>    Affects Versions: 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Plamen Jeliazkov
>         Attachments: HADOOP-10641.patch, HADOOP-10641.patch, HADOOP-10641.patch, hadoop-coordination.patch
>
>
> Coordination Engine (CE) is a system, which allows to agree on a sequence of events in
a distributed system. In order to be reliable CE should be distributed by itself.
> Coordination Engine can be based on different algorithms (paxos, raft, 2PC, zab) and
have different implementations, depending on use cases, reliability, availability, and performance
requirements.
> CE should have a common API, so that it could serve as a pluggable component in different
projects. The immediate beneficiaries are HDFS (HDFS-6469) and HBase (HBASE-10909).
> First implementation is proposed to be based on ZooKeeper.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message