hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Inder SIngh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2095) org.apache.hadoop.hdfs.server.datanode.DataNode#checkDiskError produces check storm making data node unavailable
Date Mon, 19 Nov 2012 06:03:06 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500037#comment-13500037
] 

Inder SIngh commented on HDFS-2095:
-----------------------------------

Folks,

we are hitting this in production with the same kinda of effects mentioned here. We are running
cdh3u3. 
Till the time the fix makes it into another update, can anyone suggest any mechanism to work-around
this problem.


                
> org.apache.hadoop.hdfs.server.datanode.DataNode#checkDiskError produces check storm making
data node unavailable
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2095
>                 URL: https://issues.apache.org/jira/browse/HDFS-2095
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 0.21.0
>            Reporter: Vitalii Tymchyshyn
>         Attachments: patch2.diff, patch.diff, pathch3.diff
>
>
> I can see that if data node receives some IO error, this can cause checkDir storm.
> What I mean:
> 1) any error produces DataNode.checkDiskError call
> 2) this call locks volume:
>  java.lang.Thread.State: RUNNABLE
>        at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>        at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
>        at java.io.File.exists(File.java:733)
>        at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:65)
>        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:86)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:228)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.checkDirs(FSDataset.java:414)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:617)
>        - locked <0x000000080a8faec0> (a org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>        at org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDataDir(FSDataset.java:1681)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:745)
>        at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:735)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:202)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 3) This produces timeouts on other calls, e.g.
> 2011-06-17 17:35:03,922 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError:
exception:
> java.io.InterruptedIOException
>        at java.io.FileOutputStream.writeBytes(Native Method)
>        at java.io.FileOutputStream.write(FileOutputStream.java:260)
>        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:183)
>        at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151)
>        at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111)
>        at java.lang.Thread.run(Thread.java:619)
> 4) This, in turn, produces more "dir check calls".
> 5) All the cluster works very slow because of half-working node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message