hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9874) Long living DataXceiver threads cause volume shutdown to block.
Date Mon, 21 Mar 2016 21:05:26 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205137#comment-15205137
] 

Wei-Chiu Chuang commented on HDFS-9874:
---------------------------------------

Thanks for looking into it. Maybe the NPE is unrelated.
I'm not able to fail the test, it could be an intermittent flaky test.
But in anyway, it would be great if you could improve the test diagnostics using {{GenericTestUtils#assertExceptionContains}}.
This utility method prints the stack trace if the exception message doesn't match the expected
value.

> Long living DataXceiver threads cause volume shutdown to block.
> ---------------------------------------------------------------
>
>                 Key: HDFS-9874
>                 URL: https://issues.apache.org/jira/browse/HDFS-9874
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>            Priority: Critical
>             Fix For: 2.7.3
>
>         Attachments: HDFS-9874-trunk-1.patch, HDFS-9874-trunk-2.patch, HDFS-9874-trunk.patch
>
>
> One of the failed volume shutdown took 3 days to complete.
> Below are the relevant datanode logs while shutting down a volume (due to disk failure)
> {noformat}
> 2016-02-21 10:12:55,333 [Thread-49277] WARN impl.FsDatasetImpl: Removing failed volume
volumeA/current: 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: volumeA/current/BP-1788428031-nnIp-1351700107344/current/finalized
>         at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:194)
>         at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:308)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:786)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
>         at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
>         at org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
>         at org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-02-21 10:12:55,334 [Thread-49277] INFO datanode.BlockScanner: Removing scanner for
volume volumeA (StorageID DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23)
> 2016-02-21 10:12:55,334 [VolumeScannerThread(volumeA)] INFO datanode.VolumeScanner: VolumeScanner(volumeA,
DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) exiting.
> 2016-02-21 10:12:55,335 [VolumeScannerThread(volumeA)] WARN datanode.VolumeScanner: VolumeScanner(volumeA,
DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23): error saving org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@4169ad8b.
> java.io.FileNotFoundException: volumeA/current/BP-1788428031-nnIp-1351700107344/scanner.cursor.tmp
(Read-only file system)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>         at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:669)
>         at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314)
>         at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
> 2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to delete old
dfsUsed file in volumeA/current/BP-1788428031-nnIp-1351700107344/current
> 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Failed to write dfsUsed
to volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed
> java.io.FileNotFoundException: volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed
(Read-only file system)
> 		at java.io.FileOutputStream.open(Native Method)
> 		at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
> 		at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
> 		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.saveDfsUsed(BlockPoolSlice.java:247)
> 		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.shutdown(BlockPoolSlice.java:698)
> 		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:815)
> 		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.removeVolume(FsVolumeList.java:328)
> 		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:250)
> 		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
> 		at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
> 		at org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
> 		at org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
> 		at java.lang.Thread.run(Thread.java:745)
> 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removed volume: volumeA/current
> 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Completed checkDirs.
Found 1 failure volumes.
> 2016-02-24 16:05:53,286 [Thread-49277] INFO datanode.DataNode: Deactivating volumes (clear
failure=false): volumeA
> 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removing volumeA from
FsDataset.
> 2016-02-24 16:05:53,342 [Thread-49277] INFO common.Storage: Removing block level storage:
volumeA/current/BP-1788428031-nnIp-1351700107344
> 2016-02-24 16:05:53,345 [Thread-49277] WARN datanode.DataNode: DataNode.handleDiskError:
Keep Running: true
> {noformat} 
> The datanode waits for the reference count to become zero before shutting down the volume.
> {code:title=FsVolumeImpl.java|borderStyle=solid}
> while (this.reference.getReferenceCount() > 0) {
>      if (FsDatasetImpl.LOG.isDebugEnabled()) {
>        FsDatasetImpl.LOG.debug(String.format(
>            "The reference count for %s is %d, wait to be 0.",
>            this, reference.getReferenceCount()));
>      }
>      try {
>        Thread.sleep(SLEEP_MILLIS);
>      } catch (InterruptedException e) {
>        throw new IOException(e);
>      }
>    }
> {code}
> Just before datanode logged the following line, 
> {noformat} 
> 2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to delete old
dfsUsed file in volumeA/current/BP-1788428031-nnIp-1351700107344/current
> {noformat}
> I saw the following stack trace
> {noformat}
> 2016-02-24 16:05:53,017 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at
/upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]]
INFO datanode.DataNode: Exception for BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736
> java.io.IOException: Premature EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
>         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
>         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-02-24 16:05:53,018 [PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736,
type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736,
type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
> 2016-02-24 16:05:53,018 [PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736,
type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736,
type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2016-02-24 16:05:53,018 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at
/upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]]
INFO datanode.DataNode: opWriteBlock BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736
received exception java.io.IOException: Prematur
> e EOF from inputStream
> 2016-02-24 16:05:53,018 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at
/upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]]
ERROR datanode.DataNode: thisDNName:1004:DataXceiver error processing WRITE_BLOCK operation
 src: /upStreamDNIp:55710 dst: /thisDNIp
> :1004
> java.io.IOException: Premature EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
>         at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
>         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> On tracking the block blk_7059462432_1109832821906, I see that the block was created
on 2016-02-17 15:06:28,256
> {noformat}
> 2016-02-17 15:06:28,928 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at
/upStreamDNIp:55590 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]]
INFO datanode.DataNode: Receiving BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906
src: /upStreamDNIp:55590 dest: /thisDNIp:1004
> {noformat}
> The job which created this file was running for more than 7 days and the client eventually
failed to renew the delegation token so the lease manager failed to renew lease for this file.
> Once commitBlockSynchronization kicked in, it tried to recover the block and eventually
the DataXceiver thread died decrementing the reference count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message