hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1603) Namenode gets sticky if one of namenode storage volumes disappears (removed, unmounted, etc.)
Date Wed, 01 Feb 2012 05:42:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197601#comment-13197601
] 

Harsh J commented on HDFS-1603:
-------------------------------

bq. I've noticed a failure during an unlock call that occurs AFTER a SD has been detected
as a failed point. The unlock call went ahead and blocked via a native call to the NFS lock
daemon - and since the NFS server was down, it just hung (odd that the timeout did not apply,
probably an nfs lockd issue, but I do not feel its OK to unlock after a directory has caused
a processIOError call).

Disregard the above. It was cause of a lockd bug in an earlier release of CentOS as I'd suspected.

For:
bq. ATM and I just brainstormed about this a little bit over some iced coffee. Though on the
surface it doesn't look too hard to implement timeouts on namedir operations, it would actually
have to be done in a lot of places (eg mkdirs/move calls on storage directories, writing edits,
saving images, etc). Timing out some of these things isn't entirely straightforward, since
the underlying calls aren't interruptible.

Since the hang is in processIOError or a call like that that handles the dir-errors, lets
have a timeout here instead? Should solve the same issue? Though if it was the case like above,
a thread may hang forever.
                
> Namenode gets sticky if one of namenode storage volumes disappears (removed, unmounted,
etc.)
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1603
>                 URL: https://issues.apache.org/jira/browse/HDFS-1603
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.21.0
>            Reporter: Konstantin Boudnik
>
> While investigating failures on HDFS-1602 it became apparent that once a namenode storage
volume is pulled out NN becomes completely "sticky" until {{FSImage:processIOError: removing
storage}} move the storage from the active set. During this time none of normal NN operations
are possible (e.g. creating a directory on HDFS timeouts eventually).
> In case of NFS this can be workaround'd with soft,intr,timeo,retrans settings. However,
a better handling of the situation is apparently possible and needs to be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message