Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D9527C2EF for ; Tue, 7 Aug 2012 17:02:10 +0000 (UTC) Received: (qmail 60528 invoked by uid 500); 7 Aug 2012 17:02:10 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 60370 invoked by uid 500); 7 Aug 2012 17:02:10 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 60360 invoked by uid 99); 7 Aug 2012 17:02:10 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Aug 2012 17:02:10 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 627B9142855 for ; Tue, 7 Aug 2012 17:02:10 +0000 (UTC) Date: Tue, 7 Aug 2012 17:02:10 +0000 (UTC) From: "Todd Lipcon (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1255317925.307.1344358930406.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1293417085.139.1344357370923.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430432#comment-13430432 ] Todd Lipcon commented on HDFS-3771: ----------------------------------- The following is interesting: {quote} 3. FSEditLog: Ending log segment 23963 4. FSEditLog: Starting log segment at 23967 {quote} That's not a typo? i.e there's a gap between the end of the previous segment and the start of the next? Perhaps it's just an unrelated logging error, though. > Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling > ------------------------------------------------------------------------------------------------ > > Key: HDFS-3771 > URL: https://issues.apache.org/jira/browse/HDFS-3771 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.23.3 > Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, using Kerberos based security > Reporter: patrick white > Priority: Critical > > Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty issue recently, which resulted in the cluster's default Namenode being unable to restart, this was on a 20 node Federated cluster with security. The cause appears to be that the NN was just starting to roll its edit log when a shutdown occurred, the shutdown was intentional to restart the cluster as part of an automated test. > The tests that were running do not appear to be the issue in themselves, the cluster was just wrapping up an adminReport subset and this failure case has not reproduce so far, nor was it failing previously. It looks like a chance occurrence of sending the shutdown just as the edit log roll was begun. > From the NN log, the following sequence is noted: > 1. an InvalidateBlocks operation had completed > 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr] > 3. FSEditLog: Ending log segment 23963 > 4. FSEditLog: Starting log segment at 23967 > 4. NameNode: SHUTDOWN_MSG > => the NN shuts down and then is restarted... > 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were are all in-progress > 6. FSImageTransactionalStorageInspector: Marking log at /grid/[PATH]/edits_inprogress_0000000000000023967 as corrupt since it has no transactions in it. > 7. NameNode: Exception in namenode join [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967 > => NN start attempts continue to cycle trying to restart but can't, failing on the same exception due to lack of non-corrupt edit logs > If observations are correct and issue is from shutdown happening as edit logs are rolling, does the NN have an equivalent to the conventional fs 'sync' blocking action that should be called, or perhaps has a timing hole? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira