hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10659) Namenode crashes after Journalnode re-installation in an HA cluster due to missing paxos directory
Date Thu, 21 Jul 2016 00:05:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386863#comment-15386863

Jing Zhao commented on HDFS-10659:

I think we do not need to manually recreate the "current" directory or copy the version file
here. After restarting JN1, and before shutting down JN2, try rolling the editlog segment
(dfsadmin -rollEdits). In this way, every JN will have a new segment and JN1 will work fine
in the protocol. Then shutting down JN2 should be fine.

> Namenode crashes after Journalnode re-installation in an HA cluster due to missing paxos
> --------------------------------------------------------------------------------------------------
>                 Key: HDFS-10659
>                 URL: https://issues.apache.org/jira/browse/HDFS-10659
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ha, journal-node
>    Affects Versions: 2.7.1
>            Reporter: Amit Anand
> In my environment I am seeing {{Namenodes}} crashing down after majority of {{Journalnodes}}
are re-installed. We manage multiple clusters and do rolling upgrades followed by rolling
re-install of each node including master(NN, JN, RM, ZK) nodes. When a journal node is re-installed
or moved to a new disk/host, instead of running {{"initializeSharedEdits"}} command, I copy
{{VERSION}} file from one of the other {{Journalnode}} and that allows my {{NN}} to start
writing data to the newly installed {{Journalnode}}.
> To acheive quorum for JN and recover unfinalized segments NN during starupt creates NNNN.tmp
files under {{"<disk>/jn/current/paxos"}} directory . In current implementation "paxos"
directry is only created during {{"initializeSharedEdits"}} command and if a JN is re-installed
the "paxos" directory is not created upon JN startup or by NN while writing NNNN.tmp files
which causes NN to crash with following error message:
> {code}
> /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No such file
or directory)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
>         at org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58)
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971)
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846)
>         at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205)
>         at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249)
>         at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
> {code}
> The current [getPaxosFile|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java#L128-L130]
method simply returns a path to a file under "paxos" directory without verifiying its existence.
Since "paxos" directoy holds files that are required for NN recovery and acheiving JN quorum
my proposed solution is to add a check to "getPaxosFile" method and create the {{"paxos"}}
directory if it is missing.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message