Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates
 209.85.161.41 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=tTh/5WAokdXRpyyeDP48T8f2ciQ4u50d3UpagpqIdfY9jeB26mN5GWG8HCFwT8nUzH
         h0REsm2KCbWSBJ7i7kGFR1h6mX3lflMsBCSVBsfWDMuNuHEyknPQBh4BRuC83yILgxtY
         RtLiTYdbQqEEqbdCv40+k5TwBpSypDbaU/C18=
MIME-Version: 1.0
In-Reply-To: <AANLkTikf9xgsA5zqNfJhqaiV9N+oGY_QEc5TWuXhEVf=@mail.gmail.com>
References: <AANLkTinRJjt6r4MHf7m4O1QmWOmU83iBEQCzb-cedxu+@mail.gmail.com>
	<AANLkTinU2+8aQZF2y5d7wS+kKHnC+0j5b9F6YGpU2rrn@mail.gmail.com>
	<AANLkTikf9xgsA5zqNfJhqaiV9N+oGY_QEc5TWuXhEVf=@mail.gmail.com>
Date: Fri, 28 Jan 2011 10:01:55 -0800
Message-ID: <AANLkTiniwSK0Wr+DbYeLFz0TQTk1Gqy2ZMNtH+ze_fe2@mail.gmail.com>
Subject: Re: HBASE-3234 and bad datanode error
From: Ted Yu <yuzhihong@gmail.com>
To: dev@hbase.apache.org
Content-Type: multipart/alternative; boundary=20cf3054a87d94a663049aebdcbd

--20cf3054a87d94a663049aebdcbd
Content-Type: text/plain; charset=ISO-8859-1

Last night, four reduce tasks failed with 'All datanodes .. are bad'
although no region server came down.

I wanted to try out hadoop-core-0.20-append-r1056497.jar and at the same
time preserve data on hdfs.

I renamed hadoop-core-0.20.2+320.jar on all nodes under $HADOOP_HOME and
copied hadoop-core-0.20-append-r1056497.jar to $HADOOP_HOME

After restarting hadoop, jobtracker.jsp gave me Error 503
dfshealth.jsp is accessible and shows all data nodes.
I verified that namenode is out of safemode.

Here is tail of jobtracker log: http://pastebin.com/2sJv07wy

Here is tail of namenode log: http://pastebin.com/M5nv2fEy

Here is stack trace for job tracker: http://pastebin.com/xhadk1YA

Here is jstack for namenode: http://pastebin.com/0CmE4qkV

Since hadoop-core-0.20-append-r1056497.jar came with 0.90 release, I want to
get some opinion here before posting elsewhere.
Hopefully someone would recommend the correct upgrade procedure.

Thanks

Here is fsck output:
Status: HEALTHY
 Total size:    1775777698618 B (Total open files size: 866 B)
 Total dirs:    28384
 Total files:   306547 (Files currently being written: 8)
 Total blocks (validated):      312296 (avg. block size 5686200 B) (Total
open file blocks (not validated): 4)
 Minimally replicated blocks:   312296 (100.0 %)
 Over-replicated blocks:        1 (3.2020904E-4 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.000003
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          7
 Number of racks:               1


The filesystem under path '/' is HEALTHY

On Thu, Jan 27, 2011 at 10:02 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Ted,
>
> I don't see anything relating to hbase in that log so... how can we be of
> help?
>
> J-D
>
> On Thu, Jan 27, 2011 at 3:41 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> > This seems to be the last issue blocking hbase 0.90 upgrade.
> >
> > Please comment.
> >
> > On Mon, Jan 24, 2011 at 4:58 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> >> Hi,
> >> Running 0.90 in dev cluster where I used cdh3b2 hadoop jar, I frequently
> >> saw the following in reduce task log:
> >>
> >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 22:55:39,009
> >> INFO com.carrieriq.m2m.platform.mmp3.output.DimensionMapper: Total
> >> requets=15523640 cache hit ratio=0.84543097 avg time=90.1465879780713
> >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:17:03,216
> >> WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
> >> exception  for block blk_8207645655823156697_2836871java.io.IOException:
> Bad
> >> response 1 for block blk_8207645655823156697_2836871 from datanode
> >> 10.202.50.71:50010
> >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) -       at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2497)
> >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) -
> >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:17:03,217
> >> WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
> >> blk_8207645655823156697_2836871 bad datanode[1] 10.202.50.71:50010
> >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:17:03,217
> >> WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
> >> blk_8207645655823156697_2836871 in pipeline 10.202.50.78:50010,
> >> 10.202.50.71:50010: bad datanode 10.202.50.71:50010
> >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:17:03,252
> >> INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /
> >> 10.202.50.78:50020. Already tried 0 time(s).
> >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:27:27,931
> >> WARN org.apache.hadoop.mapred.TaskRunner: Parent died.  Exiting
> >>
> >> HDFS-895 <https://issues.apache.org/jira/browse/HDFS-895> is in
> >> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320.releasenotes.html
> >>
> >> Expert opinion on what I saw is appreciated.
> >>
> >
>

--20cf3054a87d94a663049aebdcbd--