Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 43938E490 for ; Wed, 29 May 2013 23:22:21 +0000 (UTC) Received: (qmail 27539 invoked by uid 500); 29 May 2013 23:22:21 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 27498 invoked by uid 500); 29 May 2013 23:22:20 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 27489 invoked by uid 99); 29 May 2013 23:22:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 May 2013 23:22:20 +0000 Date: Wed, 29 May 2013 23:22:20 +0000 (UTC) From: "Colin Patrick McCabe (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-4859) Add timeout in FileJournalManager MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669873#comment-13669873 ] Colin Patrick McCabe commented on HDFS-4859: -------------------------------------------- Can you be more clear about why just using {{QuorumJournalManager}} plus {{ZKFC}} doesn't solve this problem? You don't actually even need local storage directories any more; we only ever recommended them because QJM new and untested. It's not just fsync that can block forever, but any write, any read, any fstat, really any blocking operation that touches the filesystem. I have seen ls go out to lunch forever on a corrupted filesystem. Are you going to add "check if I timed out and kill myself if so" recovery logic after every operation that touches the filesystem? Every {{FileInputStream}} or {{FileOutputStream}} or {{FileChannel}} method? Are you going to carefully monitor each new patch so that nobody adds back in a use of filechannel.size or whatever? > Add timeout in FileJournalManager > --------------------------------- > > Key: HDFS-4859 > URL: https://issues.apache.org/jira/browse/HDFS-4859 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, namenode > Affects Versions: 2.0.4-alpha > Reporter: Kihwal Lee > > Due to absence of explicit timeout in FileJournalManager, error conditions that incur long delay (usually until driver timeout) can make namenode unresponsive for long time. This directly affects NN's failure detection latency, which is critical in HA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira