hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2422) temporary loss of NFS mount causes NN safe mode
Date Mon, 10 Oct 2011 22:50:29 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124547#comment-13124547
] 

Aaron T. Myers commented on HDFS-2422:
--------------------------------------

Thanks a lot for the comments, Milind. Answers inline.

bq. I think it is a "good thing" (tm) that NN makes HDFS readonly when nfs is not accessible.

I can see arguments for both. In fact, I originally argued in favor of the behavior you're
describing. Upon further reflection, I think I've changed my opinion, however. At least, whatever
policy is being used for the number of failed volumes that can be tolerated when syncing edit
logs should also be used when checking for available resources in the {{NameNodeResourceChecker}},
for the purpose of consistency.

bq. HDFS is getting public criticism about "losing" data, and if hdfs modifications are allowed
by modifying a single destination, then it open up a window for losing data.

The purpose of configuring multiple {{dfs.name.dir}} directories is exactly so that the NN
can tolerate multiple failures and keep on humming. It's not going to lose any data just because
one goes offline - it will just write to the other directories.

bq. The right thing to do is to return from safemode when the NFS volume becomes available
again.

Please see [this comment|https://issues.apache.org/jira/browse/HDFS-1594?focusedCommentId=13020373&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13020373]
for the reasoning as to why the {{NameNodeResourceChecker}} doesn't automatically take the
NN out of SM when it detects a volume being low on space.
                
> temporary loss of NFS mount causes NN safe mode
> -----------------------------------------------
>
>                 Key: HDFS-2422
>                 URL: https://issues.apache.org/jira/browse/HDFS-2422
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.24.0
>            Reporter: Jeff Bean
>            Assignee: Aaron T. Myers
>
> We encountered a situation where the namenode dropped into safe mode after a temporary
outage of an NFS mount.
> At 12:10 the NFS server goes offline
> Oct  8 12:10:05 <namenode> kernel: nfs: server <nfs host> not responding,
timed out
> This caused the namenode to conclude resource issues:
> 2011-10-08 12:10:34,848 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker:
Space available on volume '<nfs host>' is 0, which is below the configured reserved
amount 104857600
> Temporary loss of NFS mount shouldn't cause safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message