hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-10641) Introduce Coordination Engine interface
Date Thu, 24 Jul 2014 10:48:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073075#comment-14073075

Steve Loughran commented on HADOOP-10641:

bq. this jira is not proposing new Consensus protocols, as stated in this comment. CoordinationEngine
here is an interface to be used with existing consensus algorithms, 

Exactly. This JIRA is proposing a plugin interface to co-ordination systems using consensus
algorithms, a plugin point intended for use by HDFS and others. It is absolutely critical
that all implementations of this plug in do exactly what is expected of them -and we cannot
do that without a clear definition of what they are meant to do, what guarantees must be met
and what failure modes are expected. 

The consensus node design document is not such a document. It's an outline of what can be
done, but it doesn't specify the API. The current patch for this JIRA contains some interfaces,
a ZK class and a single test case. Can we trust this ZK class to do what is required? Not
without a clear definition of what is required. Can we trust the test case to verify that
the ZK implementations does what is required? Not now, no. What do we do if there is a difference
between what the ZK implementation does and the interface defines -is it the interface at
fault, or the ZK implementation? What if a third-party implementation does something differently?
Whose implementation is considered the correct one?

For the filesystems, HDFS defines the behavior; my '9361 JIRA was deriving a specification
from that implementation, generating more corner case tests, and making the details of how
(every) other filesystem behaves differently a declarative bit of XML for each FS -now we
can see how they differ. We've even used it to bring the other filesystems (especially S3N)
more in line with what is expected.

This new plugin point is intended become a critical failure point for HDFS and YARN, where
the incorrect behaviour of an implementations potentially places data at risk. Yet to date,
all we have is a PDF file which, as Amazon describes it "conventional design documents consist
of prose, static diagrams, and perhaps pseudo-code in an ad hoc untestable language."

This is not a full consensus protocol; it will be straightforward to specify strictly enough
to derive tests, to tell implementors of consensus protocol-based systems how to hook up their
work to Hadoop. And, as those implementors are expected to be experts in distributed systems
and such topics, we should be able to expect them to pick up basic specification languages
just as we expect submitters of all patches to be able to write JUnit tests.

> Introduce Coordination Engine interface
> ---------------------------------------
>                 Key: HADOOP-10641
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10641
>             Project: Hadoop Common
>          Issue Type: New Feature
>    Affects Versions: 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Plamen Jeliazkov
>         Attachments: HADOOP-10641.patch, HADOOP-10641.patch, HADOOP-10641.patch, hadoop-coordination.patch
> Coordination Engine (CE) is a system, which allows to agree on a sequence of events in
a distributed system. In order to be reliable CE should be distributed by itself.
> Coordination Engine can be based on different algorithms (paxos, raft, 2PC, zab) and
have different implementations, depending on use cases, reliability, availability, and performance
> CE should have a common API, so that it could serve as a pluggable component in different
projects. The immediate beneficiaries are HDFS (HDFS-6469) and HBase (HBASE-10909).
> First implementation is proposed to be based on ZooKeeper.

This message was sent by Atlassian JIRA

View raw message