Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 38500106DD for ; Sat, 3 Aug 2013 06:32:07 +0000 (UTC) Received: (qmail 51382 invoked by uid 500); 3 Aug 2013 06:32:07 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 50302 invoked by uid 500); 3 Aug 2013 06:31:58 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 50023 invoked by uid 99); 3 Aug 2013 06:31:53 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Aug 2013 06:31:53 +0000 Date: Sat, 3 Aug 2013 06:31:53 +0000 (UTC) From: "Todd Lipcon (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-5058) QJM should validate startLogSegment() more strictly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13728467#comment-13728467 ] Todd Lipcon commented on HDFS-5058: ----------------------------------- Yep, the problem occurs if you restart the SBN and then try to transition it to active after you've loaded an fsimage that fell in the middle of a log segment. > QJM should validate startLogSegment() more strictly > --------------------------------------------------- > > Key: HDFS-5058 > URL: https://issues.apache.org/jira/browse/HDFS-5058 > Project: Hadoop HDFS > Issue Type: Bug > Components: qjm > Affects Versions: 3.0.0, 2.1.0-beta > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Attachments: hdfs-5058.txt > > > We've seen a small handful of times a case where one of the NNs in an HA cluster ends up with an fsimage checkpoint that falls in the middle of an edit segment. We're not sure yet how this happens, but one issue can happen as a result: > - Node has fsimage_500. Cluster has edits_1-1000, edits_1001_inprogress > - Node restarts, loads fsimage_500 > - Node wants to become active. It calls selectInputStreams(500). Currently, this API logs a WARN that 500 falls in the middle of the 1-1000 segment, but continues and returns no results. > - Node calls startLogSegment(501). > Currently, the QJM will accept this (incorrectly). The node then crashes when it first tries to journal a real transaction, but it ends up leaving the edits_501_inprogress lying around, potentially causing more issues later. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira