Return-Path: Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: (qmail 42233 invoked from network); 28 Jan 2011 18:02:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jan 2011 18:02:24 -0000 Received: (qmail 62483 invoked by uid 500); 28 Jan 2011 18:02:24 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 62375 invoked by uid 500); 28 Jan 2011 18:02:23 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 62367 invoked by uid 99); 28 Jan 2011 18:02:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Jan 2011 18:02:23 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 209.85.161.41 as permitted sender) Received: from [209.85.161.41] (HELO mail-fx0-f41.google.com) (209.85.161.41) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Jan 2011 18:02:16 +0000 Received: by fxm12 with SMTP id 12so4336652fxm.14 for ; Fri, 28 Jan 2011 10:01:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=NtG8uLTH8jEgJnQcy6ccx1c4omzqDGyPvWNMCYnP0lg=; b=woeJbdcgxk3Ly86Vt8Ul8XyXr1HI7JZfkRn5mf609TQj++/dLqy3TfNAdaS51nhUN2 34wjjSr9CCiYKo3lK204wMDXO8KVgnWCeeEjeis3Ykd84inTBz3Et6kAJ/pWX87DmQTD kO7wCcDKdk/088nr9lNeyTCfz0m4oZBErrvUE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=tTh/5WAokdXRpyyeDP48T8f2ciQ4u50d3UpagpqIdfY9jeB26mN5GWG8HCFwT8nUzH h0REsm2KCbWSBJ7i7kGFR1h6mX3lflMsBCSVBsfWDMuNuHEyknPQBh4BRuC83yILgxtY RtLiTYdbQqEEqbdCv40+k5TwBpSypDbaU/C18= MIME-Version: 1.0 Received: by 10.223.79.67 with SMTP id o3mr2662043fak.129.1296237715552; Fri, 28 Jan 2011 10:01:55 -0800 (PST) Received: by 10.223.78.140 with HTTP; Fri, 28 Jan 2011 10:01:55 -0800 (PST) In-Reply-To: References: Date: Fri, 28 Jan 2011 10:01:55 -0800 Message-ID: Subject: Re: HBASE-3234 and bad datanode error From: Ted Yu To: dev@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf3054a87d94a663049aebdcbd X-Virus-Checked: Checked by ClamAV on apache.org --20cf3054a87d94a663049aebdcbd Content-Type: text/plain; charset=ISO-8859-1 Last night, four reduce tasks failed with 'All datanodes .. are bad' although no region server came down. I wanted to try out hadoop-core-0.20-append-r1056497.jar and at the same time preserve data on hdfs. I renamed hadoop-core-0.20.2+320.jar on all nodes under $HADOOP_HOME and copied hadoop-core-0.20-append-r1056497.jar to $HADOOP_HOME After restarting hadoop, jobtracker.jsp gave me Error 503 dfshealth.jsp is accessible and shows all data nodes. I verified that namenode is out of safemode. Here is tail of jobtracker log: http://pastebin.com/2sJv07wy Here is tail of namenode log: http://pastebin.com/M5nv2fEy Here is stack trace for job tracker: http://pastebin.com/xhadk1YA Here is jstack for namenode: http://pastebin.com/0CmE4qkV Since hadoop-core-0.20-append-r1056497.jar came with 0.90 release, I want to get some opinion here before posting elsewhere. Hopefully someone would recommend the correct upgrade procedure. Thanks Here is fsck output: Status: HEALTHY Total size: 1775777698618 B (Total open files size: 866 B) Total dirs: 28384 Total files: 306547 (Files currently being written: 8) Total blocks (validated): 312296 (avg. block size 5686200 B) (Total open file blocks (not validated): 4) Minimally replicated blocks: 312296 (100.0 %) Over-replicated blocks: 1 (3.2020904E-4 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.000003 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 7 Number of racks: 1 The filesystem under path '/' is HEALTHY On Thu, Jan 27, 2011 at 10:02 AM, Jean-Daniel Cryans wrote: > Ted, > > I don't see anything relating to hbase in that log so... how can we be of > help? > > J-D > > On Thu, Jan 27, 2011 at 3:41 AM, Ted Yu wrote: > > This seems to be the last issue blocking hbase 0.90 upgrade. > > > > Please comment. > > > > On Mon, Jan 24, 2011 at 4:58 PM, Ted Yu wrote: > > > >> Hi, > >> Running 0.90 in dev cluster where I used cdh3b2 hadoop jar, I frequently > >> saw the following in reduce task log: > >> > >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 22:55:39,009 > >> INFO com.carrieriq.m2m.platform.mmp3.output.DimensionMapper: Total > >> requets=15523640 cache hit ratio=0.84543097 avg time=90.1465879780713 > >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:17:03,216 > >> WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor > >> exception for block blk_8207645655823156697_2836871java.io.IOException: > Bad > >> response 1 for block blk_8207645655823156697_2836871 from datanode > >> 10.202.50.71:50010 > >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - at > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2497) > >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - > >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:17:03,217 > >> WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block > >> blk_8207645655823156697_2836871 bad datanode[1] 10.202.50.71:50010 > >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:17:03,217 > >> WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block > >> blk_8207645655823156697_2836871 in pipeline 10.202.50.78:50010, > >> 10.202.50.71:50010: bad datanode 10.202.50.71:50010 > >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:17:03,252 > >> INFO org.apache.hadoop.ipc.Client: Retrying connect to server: / > >> 10.202.50.78:50020. Already tried 0 time(s). > >> INFO [2011-01-24 15:27:39] (ExecUtil.java:258) - 2011-01-24 23:27:27,931 > >> WARN org.apache.hadoop.mapred.TaskRunner: Parent died. Exiting > >> > >> HDFS-895 is in > >> http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320.releasenotes.html > >> > >> Expert opinion on what I saw is appreciated. > >> > > > --20cf3054a87d94a663049aebdcbd--