Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Wed, 29 May 2013 23:22:20 +0000 (UTC)
From: "Colin Patrick McCabe (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12649718.1369752550878.38659.1369869740895@arcas>
In-Reply-To: <JIRA.12649718.1369752550878@arcas>
References: <JIRA.12649718.1369752550878@arcas>
Subject: [jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669873#comment-13669873 ] 

Colin Patrick McCabe commented on HDFS-4859:
--------------------------------------------

Can you be more clear about why just using {{QuorumJournalManager}} plus {{ZKFC}} doesn't solve this problem?

You don't actually even need local storage directories any more; we only ever recommended them because QJM new and untested.

It's not just fsync that can block forever, but any write, any read, any fstat, really any blocking operation that touches the filesystem.  I  have seen ls go out to lunch forever on a corrupted filesystem.  Are you going  to add "check if I timed out and kill myself if so" recovery logic after every operation that touches the filesystem?  Every {{FileInputStream}} or {{FileOutputStream}} or {{FileChannel}} method?  Are you going to carefully monitor each new patch so that nobody adds back in a use of filechannel.size or whatever?
                
> Add timeout in FileJournalManager
> ---------------------------------
>
>                 Key: HDFS-4859
>                 URL: https://issues.apache.org/jira/browse/HDFS-4859
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha, namenode
>    Affects Versions: 2.0.4-alpha
>            Reporter: Kihwal Lee
>
> Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira