Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Wed, 30 Jan 2013 02:41:13 +0000 (UTC)
From: "Tsz Wo (Nicholas), SZE (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12628504.1358735801598.213007.1359513673500@arcas>
In-Reply-To: <JIRA.12628504.1358735801598@arcas>
References: <JIRA.12628504.1358735801598@arcas>
Subject: [jira] [Commented] (HDFS-4423) Checkpoint exception causes fatal
 damage to fsimage.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/HDFS-4423?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D13566=
095#comment-13566095 ]=20

Tsz Wo (Nicholas), SZE commented on HDFS-4423:
----------------------------------------------

I have run the tests with the patch.  All tests passed except TestNetUtils =
but it was due to my local network environment but not related to the patch=
.
               =20
> Checkpoint exception causes fatal damage to fsimage.
> ----------------------------------------------------
>
>                 Key: HDFS-4423
>                 URL: https://issues.apache.org/jira/browse/HDFS-4423
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 1.0.4, 1.1.1
>         Environment: CentOS 6.2
>            Reporter: ChenFolin
>            Assignee: Chris Nauroth
>            Priority: Blocker
>         Attachments: HDFS-4423-branch-1.1.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.jav=
a
> {code}
> boolean loadFSImage(MetaRecoveryContext recovery) throws IOException {
> ...
> latestNameSD.read();
>     needToSave |=3D loadFSImage(getImageFile(latestNameSD, NameNodeFile.I=
MAGE));
>     LOG.info("Image file of size " + imageSize + " loaded in "=20
>         + (FSNamesystem.now() - startTime)/1000 + " seconds.");
>    =20
>     // Load latest edits
>     if (latestNameCheckpointTime > latestEditsCheckpointTime)
>       // the image is already current, discard edits
>       needToSave |=3D true;
>     else // latestNameCheckpointTime =3D=3D latestEditsCheckpointTime
>       needToSave |=3D (loadFSEdits(latestEditsSD, recovery) > 0);
>    =20
>     return needToSave;
>   }
> {code}
> If it is the normal flow of the checkpoint,the value of latestNameCheckpo=
intTime  is equal to the value of latestEditsCheckpointTime=EF=BC=8Cand it =
will exec =E2=80=9Celse=E2=80=9D.
> The problem is that=EF=BC=8ClatestNameCheckpointTime > latestEditsCheckpo=
intTime=EF=BC=9A
> SecondNameNode starts checkpoint=EF=BC=8C
> ...
> NameNode=EF=BC=9ArollFSImage=EF=BC=8CNameNode shutdown after write latest=
NameCheckpointTime and before write latestEditsCheckpointTime.
> Start NameNode=EF=BC=9Abecause latestNameCheckpointTime > latestEditsChec=
kpointTime=EF=BC=8Cso the value of needToSave is true=EF=BC=8C and it will =
not update =E2=80=9CrootDir=E2=80=9D's nsCount that is the cluster's file n=
umber=EF=BC=88update exec at loadFSEdits =E2=80=9CFSNamesystem.getFSNamesys=
tem().dir.updateCountForINodeWithQuota()=E2=80=9D=EF=BC=89=EF=BC=8Cand then=
 =E2=80=9CsaveNamespace=E2=80=9D will write file number to fsimage whit def=
ault value =E2=80=9C1=E2=80=9D=E3=80=82
> The next time=EF=BC=8CloadFSImage will fail.
> Maybe=EF=BC=8Cit will work:
> {code}
> boolean loadFSImage(MetaRecoveryContext recovery) throws IOException {
> ...
> latestNameSD.read();
>     needToSave |=3D loadFSImage(getImageFile(latestNameSD, NameNodeFile.I=
MAGE));
>     LOG.info("Image file of size " + imageSize + " loaded in "=20
>         + (FSNamesystem.now() - startTime)/1000 + " seconds.");
>    =20
>     // Load latest edits
>     if (latestNameCheckpointTime > latestEditsCheckpointTime){
>       // the image is already current, discard edits
>       needToSave |=3D true;
>       FSNamesystem.getFSNamesystem().dir.updateCountForINodeWithQuota();
>     }
>     else // latestNameCheckpointTime =3D=3D latestEditsCheckpointTime
>       needToSave |=3D (loadFSEdits(latestEditsSD, recovery) > 0);
>    =20
>     return needToSave;
>   }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs
For more information on JIRA, see: http://www.atlassian.com/software/jira