Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8A7D1E1D9 for ; Wed, 30 Jan 2013 02:41:15 +0000 (UTC) Received: (qmail 8557 invoked by uid 500); 30 Jan 2013 02:41:15 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 8376 invoked by uid 500); 30 Jan 2013 02:41:14 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 8328 invoked by uid 99); 30 Jan 2013 02:41:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jan 2013 02:41:13 +0000 Date: Wed, 30 Jan 2013 02:41:13 +0000 (UTC) From: "Tsz Wo (Nicholas), SZE (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-4423) Checkpoint exception causes fatal damage to fsimage. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-4423?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13566= 095#comment-13566095 ]=20 Tsz Wo (Nicholas), SZE commented on HDFS-4423: ---------------------------------------------- I have run the tests with the patch. All tests passed except TestNetUtils = but it was due to my local network environment but not related to the patch= . =20 > Checkpoint exception causes fatal damage to fsimage. > ---------------------------------------------------- > > Key: HDFS-4423 > URL: https://issues.apache.org/jira/browse/HDFS-4423 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 1.0.4, 1.1.1 > Environment: CentOS 6.2 > Reporter: ChenFolin > Assignee: Chris Nauroth > Priority: Blocker > Attachments: HDFS-4423-branch-1.1.patch > > Original Estimate: 72h > Remaining Estimate: 72h > > The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.jav= a > {code} > boolean loadFSImage(MetaRecoveryContext recovery) throws IOException { > ... > latestNameSD.read(); > needToSave |=3D loadFSImage(getImageFile(latestNameSD, NameNodeFile.I= MAGE)); > LOG.info("Image file of size " + imageSize + " loaded in "=20 > + (FSNamesystem.now() - startTime)/1000 + " seconds."); > =20 > // Load latest edits > if (latestNameCheckpointTime > latestEditsCheckpointTime) > // the image is already current, discard edits > needToSave |=3D true; > else // latestNameCheckpointTime =3D=3D latestEditsCheckpointTime > needToSave |=3D (loadFSEdits(latestEditsSD, recovery) > 0); > =20 > return needToSave; > } > {code} > If it is the normal flow of the checkpoint,the value of latestNameCheckpo= intTime is equal to the value of latestEditsCheckpointTime=EF=BC=8Cand it = will exec =E2=80=9Celse=E2=80=9D. > The problem is that=EF=BC=8ClatestNameCheckpointTime > latestEditsCheckpo= intTime=EF=BC=9A > SecondNameNode starts checkpoint=EF=BC=8C > ... > NameNode=EF=BC=9ArollFSImage=EF=BC=8CNameNode shutdown after write latest= NameCheckpointTime and before write latestEditsCheckpointTime. > Start NameNode=EF=BC=9Abecause latestNameCheckpointTime > latestEditsChec= kpointTime=EF=BC=8Cso the value of needToSave is true=EF=BC=8C and it will = not update =E2=80=9CrootDir=E2=80=9D's nsCount that is the cluster's file n= umber=EF=BC=88update exec at loadFSEdits =E2=80=9CFSNamesystem.getFSNamesys= tem().dir.updateCountForINodeWithQuota()=E2=80=9D=EF=BC=89=EF=BC=8Cand then= =E2=80=9CsaveNamespace=E2=80=9D will write file number to fsimage whit def= ault value =E2=80=9C1=E2=80=9D=E3=80=82 > The next time=EF=BC=8CloadFSImage will fail. > Maybe=EF=BC=8Cit will work: > {code} > boolean loadFSImage(MetaRecoveryContext recovery) throws IOException { > ... > latestNameSD.read(); > needToSave |=3D loadFSImage(getImageFile(latestNameSD, NameNodeFile.I= MAGE)); > LOG.info("Image file of size " + imageSize + " loaded in "=20 > + (FSNamesystem.now() - startTime)/1000 + " seconds."); > =20 > // Load latest edits > if (latestNameCheckpointTime > latestEditsCheckpointTime){ > // the image is already current, discard edits > needToSave |=3D true; > FSNamesystem.getFSNamesystem().dir.updateCountForINodeWithQuota(); > } > else // latestNameCheckpointTime =3D=3D latestEditsCheckpointTime > needToSave |=3D (loadFSEdits(latestEditsSD, recovery) > 0); > =20 > return needToSave; > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira