hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager
Date Thu, 30 May 2013 01:20:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669946#comment-13669946

Colin Patrick McCabe commented on HDFS-4859:

bq. Last time I checked, which is not very long ago, FJM was not prohibited from being used
for shared.edits.dir and the manual failover was not deprecated. IIRC, CDH4 doc also states
that this combination is supported. I am sure that you are not implying improving it is a
bad idea. Please explain further. I seem to keep failing to understand your point.

I guess I should clarify.  There's nothing wrong with using NFS HA, if you already have a
substantial investment in NFS filers.  Keep in mind, though, if your NFS filer itself is not
HA, you are just moving the single point of failure around, not eliminating it.  Some Hadoop
users have invested quite a lot in highly available NFS filers, and are comfortable using
them, which is one reason NFS HA is supported in Hadoop, and will probably continue to be
so for a long time.  In some cases these filers were installed prior to Hadoop.  But supported
!= encouraged for all new installs.  In any case, NFS HA isn't the issue here, the issue is
that using either kind of HA should solve your problem without requiring FJM changes.  Another
simple thing that would solve the problem is using RAID on the local edit directories.  You
would probably have to use hardware raid, though, since in my experience software RAID on
Linux had the same issues with threads hitting really slow timeouts when reading from an array
where 1 disk was failing.
> Add timeout in FileJournalManager
> ---------------------------------
>                 Key: HDFS-4859
>                 URL: https://issues.apache.org/jira/browse/HDFS-4859
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha, namenode
>    Affects Versions: 2.0.4-alpha
>            Reporter: Kihwal Lee
> Due to absence of explicit timeout in FileJournalManager, error conditions that incur
long delay (usually until driver timeout) can make namenode unresponsive for long time. This
directly affects NN's failure detection latency, which is critical in HA.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message