hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arpit Agarwal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10659) Namenode crashes after Journalnode re-installation in an HA cluster due to missing paxos directory
Date Tue, 24 Oct 2017 23:52:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217904#comment-16217904
] 

Arpit Agarwal commented on HDFS-10659:
--------------------------------------

bq. During transition to active state, Namenode crashes if a quorum of JNs do not have the
paxos directory. This is because, the NN tries to recover the log segments and in the process
needs to write recovery data into the paxos dir. Data is written into the paxos dir only during
Journal#acceptRecovery() phase. So all we need to do is add a check and create the paxos dir
if it does not exist during this phase. 

+1 for this approach. I will review the v04 patch.

> Namenode crashes after Journalnode re-installation in an HA cluster due to missing paxos
directory
> --------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10659
>                 URL: https://issues.apache.org/jira/browse/HDFS-10659
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ha, journal-node
>    Affects Versions: 2.7.0
>            Reporter: Amit Anand
>            Assignee: Hanisha Koneru
>         Attachments: HDFS-10659.000.patch, HDFS-10659.001.patch, HDFS-10659.002.patch,
HDFS-10659.003.patch, HDFS-10659.004.patch
>
>
> In my environment I am seeing {{Namenodes}} crashing down after majority of {{Journalnodes}}
are re-installed. We manage multiple clusters and do rolling upgrades followed by rolling
re-install of each node including master(NN, JN, RM, ZK) nodes. When a journal node is re-installed
or moved to a new disk/host, instead of running {{"initializeSharedEdits"}} command, I copy
{{VERSION}} file from one of the other {{Journalnode}} and that allows my {{NN}} to start
writing data to the newly installed {{Journalnode}}.
> To acheive quorum for JN and recover unfinalized segments NN during starupt creates NNNN.tmp
files under {{"<disk>/jn/current/paxos"}} directory . In current implementation "paxos"
directry is only created during {{"initializeSharedEdits"}} command and if a JN is re-installed
the "paxos" directory is not created upon JN startup or by NN while writing NNNN.tmp files
which causes NN to crash with following error message:
> {code}
> 192.168.100.16:8485: /disk/1/dfs/jn/Test-Laptop/current/paxos/64044.tmp (No such file
or directory)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
>         at org.apache.hadoop.hdfs.util.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:58)
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.persistPaxosData(Journal.java:971)
>         at org.apache.hadoop.hdfs.qjournal.server.Journal.acceptRecovery(Journal.java:846)
>         at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.acceptRecovery(JournalNodeRpcServer.java:205)
>         at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.acceptRecovery(QJournalProtocolServerSideTranslatorPB.java:249)
>         at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25435)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
> {code}
> The current [getPaxosFile|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JNStorage.java#L128-L130]
method simply returns a path to a file under "paxos" directory without verifiying its existence.
Since "paxos" directoy holds files that are required for NN recovery and acheiving JN quorum
my proposed solution is to add a check to "getPaxosFile" method and create the {{"paxos"}}
directory if it is missing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message