hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3797) QJM: add segment txid as a parameter to journal() RPC
Date Tue, 14 Aug 2012 23:23:38 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434626#comment-13434626

Aaron T. Myers commented on HDFS-3797:

Patch looks pretty good to me. One question: have you considered adding a test case that ensures
that a JN which experiences this scenario will return to participating in the quorum after
the next finalize/new segment?

Nit: looks like the method comment for testMissFinalizeAndNextStart got messed up a little
bit: "+   *    */"
> QJM: add segment txid as a parameter to journal() RPC
> -----------------------------------------------------
>                 Key: HDFS-3797
>                 URL: https://issues.apache.org/jira/browse/HDFS-3797
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha
>    Affects Versions: QuorumJournalManager (HDFS-3077)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Minor
>         Attachments: hdfs-3797.txt
> During fault testing of QJM, I saw the following issue:
> 1) NN sends txn 5 to JN
> 2) NN gets partitioned from JN while JN remains up. The next two RPCs are missed while
the partition has happened:
> 2a) finalizeSegment(1-5)
> 2b) startSegment(6)
> 3) NN sends txn 6 to JN
> This caused one of the JNs to end up with a segment 1-10 while the others had two segments;
1-5 and 6-10. This broke some invariants of the QJM protocol and prevented the recovery protocol
from running properly.
> This can be addressed on the client side by HDFS-3726, which would cause the NN to not
send the RPC in #3. But it makes sense to also add an extra safety check here on the server
side: with every journal() call, we can send the segment's txid. Then if the JN and the client
get "out of sync", the JN can reject the RPCs.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message