Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 85979 invoked from network); 15 Feb 2011 03:51:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Feb 2011 03:51:34 -0000 Received: (qmail 78026 invoked by uid 500); 15 Feb 2011 03:51:34 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 77593 invoked by uid 500); 15 Feb 2011 03:51:31 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 77585 invoked by uid 99); 15 Feb 2011 03:51:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Feb 2011 03:51:29 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_FROM,HTML_FONT_FACE_BAD,HTML_MESSAGE,NORMAL_HTTP_TO_IP,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hovlj.ei@gmail.com designates 209.85.220.176 as permitted sender) Received: from [209.85.220.176] (HELO mail-vx0-f176.google.com) (209.85.220.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Feb 2011 03:51:25 +0000 Received: by vxb37 with SMTP id 37so3174565vxb.35 for ; Mon, 14 Feb 2011 19:51:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=i5mJauCWkO+jMGM3LL3PwCvLy+Zzb0I/HOLHUXxAN60=; b=EScCHcvaECeZ+jJB6qog/5MiibS7ynZtHvfBVranPZfu40fthFWg/87eV+ZMUuBSdN rBxWQllge7cGOdWMF7Uf+T0WNeiqnPAYAABvpm0zH5L825RFie13EHhCzJ0r/Gueoz16 AElMWH+eF8kweudZSXdyVHeYqtHuGg0GxdJTc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=OtvGPqsdZQPY8T0qqj4fAktf7ZGoPt2sI3vmuC5jX1y5OwjbjGdmNyGt4Lc1Lo8rqJ BVNWnSzvZSMrfdu8tfPlMCUetRhitSqMhZIbKNSfgOaSP89kcQ9deJwlsgTDbcdONs+j r/tmjKY4/G3OGzp2sNql3bkyB1FfWue1gmHXc= Received: by 10.220.203.77 with SMTP id fh13mr176140vcb.127.1297741863924; Mon, 14 Feb 2011 19:51:03 -0800 (PST) MIME-Version: 1.0 Received: by 10.220.163.21 with HTTP; Mon, 14 Feb 2011 19:50:26 -0800 (PST) In-Reply-To: References: From: Jameson Li Date: Tue, 15 Feb 2011 11:50:26 +0800 Message-ID: Subject: Re: my hadoop cluster namenode crashed after modifying the timestamp in some of the nodes To: hdfs-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=90e6ba53a1a0cf37bb049c4a1215 --90e6ba53a1a0cf37bb049c4a1215 Content-Type: text/plain; charset=ISO-8859-1 Hi Todd, Thanks very much. I think you are really right. I had used the hadoop-0.20-append patchs that is mentioned here: http://github.com/lenn0x/Hadoop-Append After reading the patch:0002-HDFS-278.patch , I found that the file "src/hdfs/org/apache/hadoop/hdfs/DFSClient.java" in my cluster does not contain these lines: * this.maxBlockAcquireFailures = conf.getInt("dfs.client.max.block.acquire.failures", MAX_BLOCK_ACQUIRE_FAILURES); * It just looks like this: * this.maxBlockAcquireFailures = getMaxBlockAcquireFailures(conf);* So I changed the 0002-HDFS-278.patch , and the diff between the origin 0002-HDFS-278.patch and the new patch after my change is: *diff 0002-HDFS-278.patch ../hadoop-new/patch-origion/0002-HDFS-278.patch * *0a1,10* *> From 56463073cf051f1e11b4d3921542979e53daead4 Mon Sep 17 00:00:00 2001* *> From: Chris Goffinet * *> Date: Mon, 20 Jul 2009 17:20:13 -0700* *> Subject: [PATCH 2/4] HDFS-278* *> * *> ---* *> src/hdfs/org/apache/hadoop/hdfs/DFSClient.java | 70 ++++++++++++++++++++++--* *> 1 files changed, 64 insertions(+), 6 deletions(-)* *> * *> diff --git a/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java b/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java* *2,3c12,13* *< --- src/hdfs/org/apache/hadoop/hdfs/DFSClient.java* *< +++ src/hdfs/org/apache/hadoop/hdfs/DFSClient.java* *---* *> --- a/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java* *> +++ b/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java* *19,20c29,32* *< @@ -188,5 +192,7 @@ public class DFSClient implements FSConstants, java.io.Closeable {* *< this.maxBlockAcquireFailures = getMaxBlockAcquireFailures(conf);* *---* *> @@ -167,7 +171,9 @@ public class DFSClient implements FSConstants, java.io.Closeable {* *> this.maxBlockAcquireFailures = * *> conf.getInt("dfs.client.max.block.acquire.failures",* *> MAX_BLOCK_ACQUIRE_FAILURES);* *118a131,133* *> -- * *> 1.6.3.1* *> * Did I miss some of the patchs about hadoop-0.20-append? How could I recover my NN and let it work that I can export the data? 2011/2/14 Todd Lipcon > Hi Jameson, > > My first instinct is that you have an incomplete patch series for hdfs > append, and that's what caused your problem. There were many bug fixes along > the way for hadoop-0.20-append and maybe you've missed some in your manually > patched build. > > -Todd > > > On Mon, Feb 14, 2011 at 5:49 AM, Jameson Li wrote: > >> Hi , >> >> My hadoop version is basic on hadoop 0.20.2 realase, patched >> HADOOP-4675,5745,MAPREDUCE-1070,551,1089 (support >> ganglia31,fairscheduler preemption,hdfs append), and patched >> HADOOP-6099,HDFS-278,Patches-from-Dhruba-Borthakur,HDFS-200 (support >> scribe). >> >> Last Friday I found that some of my test hadoop cluster nodes's time >> is not in the normal state, they are some number of hours beyond the >> normal time. >> So I run the next command, and add it to the crontab job. >> /usr/bin/rdate -s time-b.nist.gov >> >> And then my hadoop cluster namenode crashed, after my restarting the >> namenode. >> And I don't know whether it is relationed by modifying the time. >> >> The error log: >> 2011-02-12 18:44:46,603 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of >> blocks = 196 >> 2011-02-12 18:44:46,603 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid >> blocks = 0 >> 2011-02-12 18:44:46,603 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of >> under-replicated blocks = 29 >> 2011-02-12 18:44:46,603 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of >> over-replicated blocks = 41 >> 2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange: >> STATE* Leaving safe mode after 69 secs. >> 2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange: >> STATE* Safe mode is OFF. >> 2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange: >> STATE* Network topology has 1 racks and 5 datanodes >> 2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange: >> STATE* UnderReplicatedBlocks has 29 blocks >> 2011-02-12 18:44:46,886 INFO org.apache.hadoop.hdfs.StateChange: >> BLOCK* ask 192.168.1.14:50010 to replicate >> blk_-8806907658071633346_1750 to datanode(s) 192.168.1.83:50010 >> 2011-02-12 18:44:46,887 INFO org.apache.hadoop.hdfs.StateChange: >> BLOCK* ask 192.168.1.83:50010 to replicate >> blk_-7689075547598626554_1800 to datanode(s) 192.168.1.10:50010 >> 2011-02-12 18:44:46,887 INFO org.apache.hadoop.hdfs.StateChange: >> BLOCK* ask 192.168.1.84:50010 to replicate >> blk_-7587424527299099175_1717 to datanode(s) 192.168.1.10:50010 >> 2011-02-12 18:44:46,887 INFO org.apache.hadoop.hdfs.StateChange: >> BLOCK* ask 192.168.1.84:50010 to replicate >> blk_-6925943363757944243_1909 to datanode(s) 192.168.1.13:50010 >> 2011-02-12 18:44:46,888 INFO org.apache.hadoop.hdfs.StateChange: >> BLOCK* ask 192.168.1.14:50010 to replicate >> blk_-6835423500788375545_1928 to datanode(s) 192.168.1.10:50010 >> 2011-02-12 18:44:46,888 INFO org.apache.hadoop.hdfs.StateChange: >> BLOCK* ask 192.168.1.83:50010 to replicate >> blk_-6477488774631498652_1742 to datanode(s) 192.168.1.84:50010 >> 2011-02-12 18:44:46,889 WARN >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >> ReplicationMonitor thread received Runtime exception. >> java.lang.IllegalStateException: generationStamp (=1) == >> GenerationStamp.WILDCARD_STAMP java.lang.IllegalStateException: >> generationStamp (=1) == GenerationStamp.WILDCARD_STAMP >> at >> org.apache.hadoop.hdfs.protocol.Block.validateGenerationStamp(Block.java:148) >> at org.apache.hadoop.hdfs.protocol.Block.compareTo(Block.java:156) >> at org.apache.hadoop.hdfs.protocol.Block.compareTo(Block.java:30) >> at java.util.TreeMap.put(TreeMap.java:545) >> at java.util.TreeSet.add(TreeSet.java:238) >> at >> org.apache.hadoop.hdfs.server.namenode.DatanodeDescriptor.addBlocksToBeInvalidated(DatanodeDescriptor.java:284) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.invalidateWorkForOneNode(FSNamesystem.java:2743) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.computeInvalidateWork(FSNamesystem.java:2419) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.computeDatanodeWork(FSNamesystem.java:2412) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:2357) >> at java.lang.Thread.run(Thread.java:619) >> 2011-02-12 18:44:46,892 INFO >> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: >> /************************************************************ >> SHUTDOWN_MSG: Shutting down NameNode at hadoop5/192.168.1.84 >> ************************************************************/ >> >> >> Thanks, >> Jameson >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > --90e6ba53a1a0cf37bb049c4a1215 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Todd,

Thanks very much. I= think you are really right.

I had used the=A0hadoop-0.= 20-append=A0patchs that is mentioned here:http://github.com/lenn0x/Hadoop-Append

After reading the patch:0002-HDFS-278.patch=A0, I found tha= t the file "src/hdfs/org/apache/hadoop/hdfs/DF= SClient.java" in my cluster does not contain=A0these lines:

this.ma= xBlockAcquireFailures =3D
=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0con= f.getInt("dfs.client.max.block.acquire.failures",
=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0MAX_BLOCK_ACQUIRE_FAILURES);

It just looks like this:<= /div>
=A0=A0 this.maxBlockAcquireFailures =3D getMaxBlockAcquireFail= ures(conf);

So I changed the=A00002-HDFS-278.patch=A0, and the diff between the = origin=A00002-HDFS-278.patch=A0and the new patch after my cha= nge is:
diff 0002-HDFS-278.patch ../hadoop-new/patch-origion/0002-HDFS-278.patch= =A0
0a1,10
> From 56463073cf051f1e= 11b4d3921542979e53daead4 Mon Sep 17 00:00:00 2001
> From: Chris Goffinet = <cg@chrisgoffinet.com>
> Date: Mon, 20 Jul 200= 9 17:20:13 -0700
> Subject: [PATCH 2/4] = HDFS-278
>=A0<= /div>
> ---=
> =A0src/hdfs/org/apach= e/hadoop/hdfs/DFSClient.java | =A0 70 ++++++++++++++++++++++--
> =A01 files changed, 6= 4 insertions(+), 6 deletions(-)
>=A0<= /div>
> diff --git a/src/hdfs= /org/apache/hadoop/hdfs/DFSClient.java b/src/hdfs/org/apache/hadoop/hdfs/DF= SClient.java
2,3c12,13
< --- src/hdfs/org/apac= he/hadoop/hdfs/DFSClient.java
< +++ src/hdfs/org/apac= he/hadoop/hdfs/DFSClient.java
---
> --- a/src/hdfs/org/ap= ache/hadoop/hdfs/DFSClient.java
> +++ b/src/hdfs/org/ap= ache/hadoop/hdfs/DFSClient.java
19,20c29,32
< @@ -188,5 +192,7 @@ p= ublic class DFSClient implements FSConstants, java.io.Closeable {
< =A0 =A0 =A0this.ma= xBlockAcquireFailures =3D getMaxBlockAcquireFailures(conf);<= /font>
---
> @@ -167,7 +171,9 @@ p= ublic class DFSClient implements FSConstants, java.io.Closeable {
> =A0 =A0 =A0this.ma= xBlockAcquireFailures =3D=A0
> =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0conf.getInt("dfs.client.max.bloc= k.acquire.failures",
> =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0MAX_BLOCK_ACQ= UIRE_FAILURES);
118a131,133
> --=A0
> 1.6.3.1
>=A0<= /div>

Did I miss some = of the patchs about=A0hadoop-0.20-append?
How could I recover my NN and =A0let it work tha= t I can export the data?

2011/2/14 Todd Lipcon <todd@cloudera.com>
Hi Jameson,

My first instinct is that you have an incomp= lete patch series for hdfs append, and that's what caused your problem.= There were many bug fixes along the way for hadoop-0.20-append and maybe y= ou've missed some in your manually patched build.

-Todd


On Mon, Feb 14, 2011 at 5:49 AM, Jameson Li <hovlj.ei@= gmail.com> wrote:
Hi ,

My hadoop version is basic on=A0hadoop 0.20.2 realase, patched
HADOOP-4675,5745,MAPREDUCE-1070,551,1089 (support
ganglia31,fairscheduler preemption,hdfs append), and patched
HADOOP-6099,HDFS-278,Patches-from-Dhruba-Borthakur,HDFS-200 (support
scribe).

Last Friday I found that some of my test hadoop cluster nodes's time is not in the normal state, they are some number of hours beyond the
normal time.
So I run the next command, and add it to the crontab job.
/usr/bin/rdate -s=A0ti= me-b.nist.gov

And then=A0my hadoop cluster namenode crashed, after my restarting the name= node.
And I don't know whether it is relationed by modifying the time.

The error log:
2011-02-12 18:44:46,603 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of
blocks =3D 196
2011-02-12 18:44:46,603 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid
blocks =3D 0
2011-02-12 18:44:46,603 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
under-replicated blocks =3D 29
2011-02-12 18:44:46,603 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
over-replicated blocks =3D 41
2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange:
STATE* Leaving safe mode after 69 secs.
2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange:
STATE* Safe mode is OFF.
2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange:
STATE* Network topology has 1 racks and 5 datanodes
2011-02-12 18:44:46,603 INFO org.apache.hadoop.hdfs.StateChange:
STATE* UnderReplicatedBlocks has 29 blocks
2011-02-12 18:44:46,886 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask=A0192.16= 8.1.14:50010=A0to replicate
blk_-8806907658071633346_1750 to datanode(s)=A0192.168.1.83:50010
2011-02-12 18:44:46,887 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask=A0192.16= 8.1.83:50010=A0to replicate
blk_-7689075547598626554_1800 to datanode(s)=A0192.168.1.10:50010
2011-02-12 18:44:46,887 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask=A0192.16= 8.1.84:50010=A0to replicate
blk_-7587424527299099175_1717 to datanode(s)=A0192.168.1.10:50010
2011-02-12 18:44:46,887 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask=A0192.16= 8.1.84:50010=A0to replicate
blk_-6925943363757944243_1909 to datanode(s)=A0192.168.1.13:50010
2011-02-12 18:44:46,888 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask=A0192.16= 8.1.14:50010=A0to replicate
blk_-6835423500788375545_1928 to datanode(s)=A0192.168.1.10:50010
2011-02-12 18:44:46,888 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* ask=A0192.16= 8.1.83:50010=A0to replicate
blk_-6477488774631498652_1742 to datanode(s)=A0192.168.1.84:50010
2011-02-12 18:44:46,889 WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
ReplicationMonitor thread received Runtime exception.
java.lang.IllegalStateException: generationStamp (=3D1) =3D=3D
GenerationStamp.WILDCARD_STAMP java.lang.IllegalStateException:
generationStamp (=3D1) =3D=3D GenerationStamp.WILDCARD_STAMP
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.protocol.Block.validateGenerati= onStamp(Block.java:148)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.protocol.Block.compareTo(Block.= java:156)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.protocol.Block.compareTo(Block.= java:30)
=A0=A0 =A0 =A0 =A0at java.util.TreeMap.put(TreeMap.java:545)
=A0=A0 =A0 =A0 =A0at java.util.TreeSet.add(TreeSet.java:238)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.DatanodeDescrip= tor.addBlocksToBeInvalidated(DatanodeDescriptor.java:284)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.in= validateWorkForOneNode(FSNamesystem.java:2743)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.co= mputeInvalidateWork(FSNamesystem.java:2419)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.co= mputeDatanodeWork(FSNamesystem.java:2412)
=A0=A0 =A0 =A0 =A0at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$Re= plicationMonitor.run(FSNamesystem.java:2357)
=A0=A0 =A0 =A0 =A0at java.lang.Thread.run(Thread.java:619)
2011-02-12 18:44:46,892 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop5/192.168.1.84
************************************************************/


Thanks,
Jameson



-- Todd Lipcon
Software Engineer, Cloudera

--90e6ba53a1a0cf37bb049c4a1215--