hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhouguangwei (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x
Date Wed, 17 Apr 2019 02:42:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818740#comment-16818740
] 

zhouguangwei edited comment on HDFS-13596 at 4/17/19 2:41 AM:
--------------------------------------------------------------

after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x client to
read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color}
 {color:#d04437}xxx org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException:
Got access token error, status message , ack with firstBadLink as x.x.x.x:x{color}
 {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
 {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
 {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
 {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
 {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-x-1552442036118:blk_1073742246_1422{color}
 {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
 {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color}


was (Author: zgw):
after rollingUpgrade NN nodes to 3.x and keep DN 2.x ,  at this point, use 2.x client to
read or write data to hdfs will failure

write failure sample:

{color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color}
 {color:#d04437}org.apache.hadoop.hdfs.security.token.block.InvalidBlockTokenException: Got
access token error, status message , ack with firstBadLink as x.x.x.x:25009{color}
 {color:#d04437} at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134){color}
 {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1823){color}
 {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1724){color}
 {color:#d04437} at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:713){color}
 {color:#d04437}x.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Abandoning BP-1321128176-x-1552442036118:blk_1073742246_1422{color}
 {color:#d04437}.x.x.x 19/04/16 15:21:42 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[x:25009,DS-63920a14-79b9-497a-b741-21bdf1401ad1,DISK]{color}
 {color:#d04437}19/04/16 15:21:42 INFO hdfs.DataStreamer: Exception in createBlockOutputStream{color}

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -----------------------------------------------------
>
>                 Key: HDFS-13596
>                 URL: https://issues.apache.org/jira/browse/HDFS-13596
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>            Reporter: Hanisha Koneru
>            Assignee: Fei Hui
>            Priority: Critical
>         Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, HDFS-13596.003.patch,
HDFS-13596.004.patch, HDFS-13596.005.patch, HDFS-13596.006.patch, HDFS-13596.007.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails while replaying
edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to editLogs (before
finalizing the upgrade) is the pre-upgrade layout version (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout version. In
3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the upgrade will have
the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old layout version
from the editLog file. When parsing the transactions, it assumes that the transactions are
also from the previous layout and hence skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and leads to NN
shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:86)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.<init>(RetryCache.java:163)
>  at org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered
exception loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less than the current
value (=16389), where newValue=16388
>  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> Caused by: java.lang.IllegalStateException: Cannot skip to less than the current value
(=16389), where newValue=16388
>  at org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58)
>  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1943)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message