hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1158) HDFS-457 increases the chances of losing blocks
Date Mon, 30 Aug 2010 01:21:56 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904059#action_12904059

dhruba borthakur commented on HDFS-1158:

> If we make the DN automatically decommission itself when it finds a bad disk then we
might as well revert HDFS-457 no?

I think there is a major difference. The case when the datanode shuts down when it encounters
a disk eror is more likely to result in "missing" blocks compared to the policy of decommissing
first before shutting down.

> The solution for the first issue seems straight-forward, DNs should handle system disk
failure gracefully by decommissioning themselves.


> For the second issue, it seems like the desired behavior (assuming we want DNs to tolerate
disk failures) is for the DN to restart successfully 

This sounds good as long as there is a way for the administrator to figure out that a disk
has gone bad, so that he/she can schedule a repair.

>  HDFS-457 increases the chances of losing blocks
> ------------------------------------------------
>                 Key: HDFS-1158
>                 URL: https://issues.apache.org/jira/browse/HDFS-1158
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 0.20.2
>            Reporter: Koji Noguchi
>             Fix For: 0.20.3
>         Attachments: rev-HDFS-457.patch
> Whenever we restart a cluster, there's a chance of losing some blocks if more than three
datanodes don't come up.
> HDFS-457 increases this chance by keeping the datanodes up even when 
>    # /tmp disk goes read-only
>    # /disk0 that is used for storing PID goes read-only 
> and probably more.
> In our environment, /tmp and /disk0 are from the same device.
> When trying to restart a datanode, it would fail with
> 1) 
> {noformat}
> 2010-05-15 05:45:45,575 WARN org.mortbay.log: tmpdir
> java.io.IOException: Read-only file system
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.checkAndCreate(File.java:1704)
>         at java.io.File.createTempFile(File.java:1792)
>         at java.io.File.createTempFile(File.java:1828)
>         at org.mortbay.jetty.webapp.WebAppContext.getTempDirectory(WebAppContext.java:745)
> {noformat}
> or 
> 2) 
> {noformat}
> hadoop-daemon.sh: line 117: /disk/0/hadoop-datanode....com.out: Read-only file system
> hadoop-daemon.sh: line 118: /disk/0/hadoop-datanode.pid: Read-only file system
> {noformat}
> I can recover the missing blocks but it takes some time.
> Also, we are losing track of block movements since log directory can also go to read-only
but datanode would continue running.
> For 0.21 release, can we revert HDFS-457 or make it configurable?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message