hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs
Date Thu, 28 Jun 2012 21:37:46 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403494#comment-13403494
] 

Todd Lipcon commented on HDFS-3077:
-----------------------------------

The following are the diffs introduced by the HDFS-3092 branch as of 1346682

{code}
todd@todd-w510:~/git/hadoop-common/hadoop-hdfs-project$ git diff --stat origin/trunk..HDFS-3092
 .../hadoop-hdfs/CHANGES.HDFS-3092.txt              |   42 +++
 hadoop-hdfs-project/hadoop-hdfs/pom.xml            |   23 ++
 .../java/org/apache/hadoop/hdfs/DFSConfigKeys.java |   12 +-
 .../src/main/webapps/journal/index.html            |   29 ++
 .../src/main/webapps/journal/journalstatus.jsp     |   42 +++
 .../src/main/webapps/proto-journal-web.xml         |   17 +
 .../main/java/org/apache/hadoop/hdfs/DFSUtil.java  |   58 ++++
 .../java/org/apache/hadoop/hdfs/TestDFSUtil.java   |   41 +++
 .../hdfs/protocol/UnregisteredNodeException.java   |    4 +
 .../hdfs/protocolPB/JournalSyncProtocolPB.java     |   41 +++
 .../JournalSyncProtocolServerSideTranslatorPB.java |   60 ++++
 .../JournalSyncProtocolTranslatorPB.java           |   79 +++++
 .../server/protocol/JournalServiceProtocols.java   |   27 ++
 .../hdfs/server/protocol/JournalSyncProtocol.java  |   58 ++++
 .../src/main/proto/JournalSyncProtocol.proto       |   57 +++
 .../journalservice/GetJournalEditServlet.java      |  177 ++++++++++
 .../hadoop/hdfs/server/journalservice/Journal.java |  130 +++++++
 .../server/journalservice/JournalDiskWriter.java   |   61 ++++
 .../server/journalservice/JournalHttpServer.java   |  172 ++++++++++
 .../server/journalservice/JournalListener.java     |    4 +-
 .../hdfs/server/journalservice/JournalService.java |  359 +++++++++++++++++---
 .../hadoop/hdfs/server/namenode/FSEditLog.java     |   65 ++++-
 .../hadoop/hdfs/server/namenode/FSImage.java       |    4 +-
 .../hdfs/server/namenode/GetImageServlet.java      |   10 +-
 .../hadoop/hdfs/server/namenode/NNStorage.java     |    4 +-
 .../hdfs/server/namenode/TransferFsImage.java      |    4 +-
 .../hadoop/hdfs/server/namenode/JournalSet.java    |   34 ++
 .../org/apache/hadoop/hdfs/MiniDFSCluster.java     |    9 +
 .../hdfs/server/journalservice/TestJournal.java    |   71 ++++
 .../journalservice/TestJournalHttpServer.java      |  311 +++++++++++++++++
 .../server/journalservice/TestJournalService.java  |  150 +++++++--
 31 files changed, 2071 insertions(+), 84 deletions(-)
{code}

For each of these, I'll explain how the code was incorporated in the 3077 work. Or, in the
case that the code was not carried over, explain why it doesn't make sense.


{code}
 hadoop-hdfs-project/hadoop-hdfs/pom.xml            |   23 ++
 .../java/org/apache/hadoop/hdfs/DFSConfigKeys.java |   12 +-
 .../src/main/webapps/journal/index.html            |   29 ++
 .../src/main/webapps/journal/journalstatus.jsp     |   42 +++
 .../src/main/webapps/proto-journal-web.xml         |   17 +
{code}

These are carried over, with the exception of the HTTPS-related keys. Trunk has moved away
from Kerberized HTTPS and towards SPNEGO for authenticated HTTP.

{code}
 .../main/java/org/apache/hadoop/hdfs/DFSUtil.java  |   58 ++++
 .../java/org/apache/hadoop/hdfs/TestDFSUtil.java   |   41 +++
{code}

These diffs provided a way to parse out the list of journal nodes from the Configuration object.
But, it makes more sense to provide this list in the actual URI of the edits logs, for consistency
with the existing BKJM implementation. So, the equivalent method now exists as {{QuorumJournalManager.getLoggerAddresses(URI)}}.
Moving it from DFSUtil to QJM also made sense so that all of the new code was self-contained
in its own package (per the spirit of Journal Managers being pluggable components, I don't
think we should refer to them from the main DFS code)

{code}
 .../hdfs/protocol/UnregisteredNodeException.java   |    4 +
{code}

Given that the NN is configured with a static list of Journal Managers to write to, and that
membership is static, there's no need for a registration concept with the JNs -- the NN sends
RPCs _to_ the JNs, rather than the other way around.

{code}
 .../hdfs/protocolPB/JournalSyncProtocolPB.java     |   41 +++
 .../JournalSyncProtocolServerSideTranslatorPB.java |   60 ++++
 .../JournalSyncProtocolTranslatorPB.java           |   79 +++++
 .../server/protocol/JournalServiceProtocols.java   |   27 ++
 .../hdfs/server/protocol/JournalSyncProtocol.java  |   58 ++++
 .../src/main/proto/JournalSyncProtocol.proto       |   57 +++
{code}

I merged this protocol in with the other RPC protocol, since it reused the same types anyway.
If there's a strong motivation to have this as a separate protocol, I could be convinced,
but I think we need to make use of JN-specific items like epoch ID in here, which wouldn't
make sense in the context of a NN or other edits storage.

{code}
 .../journalservice/GetJournalEditServlet.java      |  177 ++++++++++
{code}

This has been moved into the qjournal package. It is otherwise mostly the same. The one salient
difference is that it now only accepts the {{startTxId}} of the segment to be downloaded.
This was necessary because, during the recovery step, the source node for synchronization
may finalize its log segment while other nodes are in the process of synchronizing from it.
So, we look for either in-progress or finalized logs. I can explain this in further detail
if necessary.

{code}
 .../hadoop/hdfs/server/journalservice/Journal.java |  130 +++++++
 .../server/journalservice/JournalDiskWriter.java   |   61 ++++
{code}
These have been merged and moved to the qjournal package, and simplified to work directly
with a single FileJournalManager instead of an FSEditLog and NNStorage. The reasoning is that
a journal's storage is not the same as a NameNode's storage, nor is it a fully general-purpose
wrapper around edit logs. For example, we are not trying to support running a JournalNode
which itself logs to a pluggable backend log. Additionally, at this point, we can only correctly
support a single directory under a quorum participant, or else there are a lot more edge cases
to consider where a node may renege on its promises if its set of directories changes during
a restart.

{code}
 .../server/journalservice/JournalHttpServer.java   |  172 ++++++++++
{code}

This has been moved to the qjournal package, and otherwise mostly the same. The one difference
is that I switched to SPNEGO instead of the kerberized SSL, to match trunk.

{code}
 .../server/journalservice/JournalListener.java     |    4 +-
{code}
The change here in the 3092 branch was just adding a new exception to a method, which didn't
turn out to be necessary anymore.

{code}
 .../hdfs/server/journalservice/JournalService.java |  359 +++++++++++++++++---
{code}
This class is called JournalNode now. We no longer have a state machine here - the recovery
process/syncing process is coordinated by the NameNode/client side, and the commit protocol
ensures that every segment is always finalized on a majority of nodes. So the state machine
isn't necessary to ensure that all the log segments are replicated successfully.

{code}
 .../hadoop/hdfs/server/namenode/FSEditLog.java     |   65 ++++-
 .../hadoop/hdfs/server/namenode/FSImage.java       |    4 +-
{code}
The new functions added here were necessary for the synchronization process above, but not
necessary for the recovery process implemented by 3077.

{code}
 .../hdfs/server/namenode/GetImageServlet.java      |   10 +-
 .../hadoop/hdfs/server/namenode/NNStorage.java     |    4 +-
 .../hdfs/server/namenode/TransferFsImage.java      |    4 +-
{code}

Just changes to make things public -- same changes in 3077 patch.

{code}
 .../hadoop/hdfs/server/namenode/JournalSet.java    |   34 ++
{code}
Since the Journal talks directly to FJM in 3077, the {{getFinalizedSegments}} function added
here is instead fulfilled by making the existing function {{FileJournalManager.getLogFiles()}}
public.

{code}
 .../org/apache/hadoop/hdfs/MiniDFSCluster.java     |    9 +
{code}
The 3077 branch has a MiniJournalCluster which can start/stop/restart JournalNodes. The change
here in the 3092 branch appears to be incomplete -- it adds functions which are never called.

{code}
 .../hdfs/server/journalservice/TestJournal.java    |   71 ++++
 .../journalservice/TestJournalHttpServer.java      |  311 +++++++++++++++++
 .../server/journalservice/TestJournalService.java  |  150 +++++++--
{code}

Several of these tests are carried over into similarly named tests in 3077. Others didn't
make sense due to changes mentioned above. The test coverage in the patch attached here is
comparable to the coverage in 3092, and I'm working on adding a lot more tests currently.


So, to summarize the key similarities and differences:
- most of the HTTP server, RPC server, node wrapper code, JSP pages, journal HTTP servlet
carried over
-- SPNEGO support instead of KSSL
-- Lifecycle code a little different in order to work with MiniJournalCluster
- the edits "synchronization" code is mostly removed, since the synchronization is now NN-led.
If we feel strongly that the JNs should synchronize "old" log segments, instead of just ensuring
that every segment is stored on a quorum, we can bring some of this back. But we're already
guaranteed that every segment has two replicas by this design.
- new JNStorage class, since the JN doesn't actually share most of its storage code with the
NameNode (eg we have no images, no checkpoints, no distributed upgrade coordination, etc
- Journal updated to use above JNStorage class instead of NNStorage and FSEditLog
- Not attempting to share the RPC protocol with BackupNode or NameNode, since we need quorum-specific
information like - Not attempting to share the RPC protocol with BackupNode or NameNode, since
we need quorum-specific information like epoch numbers in every RPC, and those don't make
sense in the other contexts.

                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3077-partial.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on shared storage
such as an NFS filer for the shared edit log. One alternative that has been proposed is to
depend on BookKeeper, a ZooKeeper subproject which provides a highly available replicated
edit log on commodity hardware. This JIRA is to implement another alternative, based on a
quorum commit protocol, integrated more tightly in HDFS and with the requirements driven only
by HDFS's needs rather than more generic use cases. More details to follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message