hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohit Kochar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2095) org.apache.hadoop.hdfs.server.datanode.DataNode#checkDiskError produces check storm making data node unavailable
Date Tue, 05 Mar 2013 17:58:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593673#comment-13593673
] 

Rohit Kochar commented on HDFS-2095:
------------------------------------

Even we have hit this issue in production.In our case most of checkDiskError() invocations
are happening due to network related exceptions in BlockReciever$PacketResponder.run() and
DataNode$DataTransfer.run().
Since the current code has a common catch clause for all types of exceptions hence even for
network related IO Exceptions chedkDiskError() in executed which lead to check storm thereby
slowing the datanode.

One way to fix this could be to check in the catch clause that whether the class of exception
belongs to "java.net" package and in those cases skip checking the disk.

Folks,
Please suggest if you think that above mentioned fix is the right approach to go about.
I can than submit the patch for this issue.

                
> org.apache.hadoop.hdfs.server.datanode.DataNode#checkDiskError produces check storm making
data node unavailable
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2095
>                 URL: https://issues.apache.org/jira/browse/HDFS-2095
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 0.21.0
>            Reporter: Vitalii Tymchyshyn
>            Assignee: Todd Lipcon
>         Attachments: patch2.diff, patch.diff, pathch3.diff
>
>
> I can see that if data node receives some IO error, this can cause checkDir storm.
> What I mean:
> 1) any error produces DataNode.checkDiskError call
> 2) this call locks volume:
>  java.lang.Thread.State: RUNNABLE
>        at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>        at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
>        at java.io.File.exists(File.java:733)
>        at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:65)
>        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:86)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:228)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.checkDirs(FSDataset.java:414)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:617)
>        - locked <0x000000080a8faec0> (a org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDataDir(FSDataset.java:1681)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:745)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:735)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:202)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 3) This produces timeouts on other calls, e.g.
> 2011-06-17 17:35:03,922 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError:
exception:
> java.io.InterruptedIOException
>        at java.io.FileOutputStream.writeBytes(Native Method)
>        at java.io.FileOutputStream.write(FileOutputStream.java:260)
>        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:183)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 4) This, in turn, produces more "dir check calls".
> 5) All the cluster works very slow because of half-working node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message