Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 9694 invoked from network); 11 May 2010 05:01:59 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 May 2010 05:01:59 -0000 Received: (qmail 97728 invoked by uid 500); 11 May 2010 05:01:58 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 97453 invoked by uid 500); 11 May 2010 05:01:57 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 97445 invoked by uid 99); 11 May 2010 05:01:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 May 2010 05:01:57 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=AWL,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.210.184] (HELO mail-yx0-f184.google.com) (209.85.210.184) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 May 2010 05:01:52 +0000 Received: by yxe14 with SMTP id 14so1070217yxe.5 for ; Mon, 10 May 2010 22:01:31 -0700 (PDT) Received: by 10.231.183.197 with SMTP id ch5mr3053577ibb.22.1273554091188; Mon, 10 May 2010 22:01:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.160.70 with HTTP; Mon, 10 May 2010 22:01:11 -0700 (PDT) In-Reply-To: References: From: Todd Lipcon Date: Mon, 10 May 2010 22:01:11 -0700 Message-ID: Subject: Re: HBase fail-over/reliability issues To: hbase-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi James, I'd recommend just the following in your log4j properties to tone down the log volume: log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=3DWARN log4j.logger.org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace=3D= WARN This will keep the INFO level logs that are very useful for debugging the issue, but shouldn't generate GB after GB. I'm both sorry and glad to hear your problem is reproducing every day. The bad news is you've got a problem, the good news is we should be able to get to the bottom of it much faster ! :) -Todd On Mon, May 10, 2010 at 9:59 AM, James Baldassari w= rote: > Hi Todd, > > Our log files were getting to be several gigabytes in size at the INFO le= vel > (particularly the datanode logs), so we changed the log level in all log4= j > configs to be WARN. =A0Do you think we're potentially missing some useful > information at INFO and lower? =A0I could lower the log level if you thin= k it > would help. > > I did go back as far as I could in the logs to try to find out what was > going on as soon as the problem appeared. =A0I didn't see anything > interesting, possibly because the log level was at WARN. =A0The only two > errors I see are these: > > 2010-05-06 19:15:56,176 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.25.119.115:50010, > storageID=3DDS-2139827131-10.25.119.115-50010-1266376030910, infoPort=3D5= 0075, > ipcPort=3D50020):DataXceiver > java.io.EOFException > > This happens once every 5 minutes like clockwork. =A0I'm not sure if this= has > anything to do with the problem. =A0Then all of a sudden those errors abo= ut > the block being invalid started showing up: > > 2010-05-06 19:17:37,946 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.25.119.115:50010, > storageID=3DDS-2139827131-10.25.119.115-50010-1266376030910, infoPort=3D5= 0075, > ipcPort=3D50020):Got exception while serving blk_-26696347696547536_49275= to / > 10.24.166.67: > java.io.IOException: Block blk_-26696347696547536_49275 is not valid. > > So let me know if I should change the log level and whether I need to ena= ble > NN clienttrace. =A0I don't think I'll have to wait long for this problem = to > reappear. =A0It seems to be happening almost every day now. > > Thanks, > James > > > On Sat, May 8, 2010 at 6:43 PM, Todd Lipcon wrote: > >> Hi James, >> >> You'll need to go farther back in the logs to find what happened to the >> block that caused it to get deleted. All of the logs below are too late >> (the >> block's already gone, we need to figure out why). >> >> Can you look backwards through the past several days of the NN logs? Hav= e >> you disabled the NN clienttrace log in log4j.properties? >> >> -Todd >> >> On Fri, May 7, 2010 at 11:17 PM, James Baldassari > >wrote: >> >> > OK, these logs are huge, so I'm just going to post the first 1,000 lin= es >> > from each for now. =A0Let me know if it would be helpful to have more.= =A0The >> > namenode logs didn't contain either of the strings you were interested >> in. >> > A few of the datanode logs had '4841840178880951849': >> > >> > http://pastebin.com/4MxzkxjW >> > http://pastebin.com/C7XErfTJ >> > http://pastebin.com/tn8pk2Xa >> > >> > One of the region servers had log entries about the region 73382377: >> > >> > http://pastebin.com/8uh6hQxQ >> > >> > We rebooted the name node as well as all data nodes and region servers= ; >> > then >> > we ran a major compaction on both of our tables. =A0The cluster seems >> stable >> > at this point, but I have a bad feeling that this problem may hit us >> again >> > soon because it has happened twice in the past two days now. >> > >> > -James >> > >> > >> > On Sat, May 8, 2010 at 12:30 AM, Todd Lipcon wrote= : >> > >> > > If you can grep for '4841840178880951849' as well >> > > as /hbase/users/73382377/data/312780071564432169 across all of your >> > > datanode >> > > logs plus your NN, and put that online somewhere, that would be grea= t. >> If >> > > you can grep with -C 20 to get some context that would help as well. >> > > >> > > Grepping for the region in question (73382377) in the RS logs would >> also >> > be >> > > helpful. >> > > >> > > Thanks >> > > -Todd >> > > >> > > On Fri, May 7, 2010 at 9:16 PM, James Baldassari < >> jbaldassari@gmail.com >> > > >wrote: >> > > >> > > > On Sat, May 8, 2010 at 12:02 AM, Stack wrote: >> > > > >> > > > > On Fri, May 7, 2010 at 8:27 PM, James Baldassari < >> > > jbaldassari@gmail.com> >> > > > > wrote: >> > > > > > java.io.IOException: Cannot open filename >> > > > > > /hbase/users/73382377/data/312780071564432169 >> > > > > > >> > > > > This is the regionserver log? =A0Is this deploying the region? = =A0It >> > fails? >> > > > > >> > > > >> > > > This error is on the client accessing HBase. =A0This exception was >> thrown >> > > on >> > > > a >> > > > get call to an HTable instance. =A0I'm not sure if it was deployin= g the >> > > > region. =A0All I know is that the system had been running with all >> > regions >> > > > available (as far as I know), and then all of a sudden these error= s >> > > started >> > > > showing up on the client. >> > > > >> > > > >> > > > > >> > > > > > Our cluster throughput goes from around 3k requests/second dow= n >> to >> > > > > 500-1000 >> > > > > > and does not recover without manual intervention. =A0The regio= n >> > server >> > > > log >> > > > > for >> > > > > > that region says: >> > > > > > >> > > > > > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to / >> > > > > > 10.24.166.74:50010 for file >> > > > > /hbase/users/73382377/data/312780071564432169 >> > > > > > for block -4841840178880951849:java.io.IOException: Got error = in >> > > > response >> > > > > to >> > > > > > OP_READ_BLOCK for file >> > /hbase/users/73382377/data/312780071564432169 >> > > > for >> > > > > > block -4841840178880951849 >> > > > > > >> > > > > > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 = on >> > > 60020, >> > > > > call >> > > > > > get([B@25f907b4, row=3D963aba6c5f351f5655abdc9db82a4cbd, >> > > maxVersions=3D1, >> > > > > > timeRange=3D[0,9223372036854775807), families=3D{(family=3Ddat= a, >> > > > columns=3DALL}) >> > > > > > from 10.24.117.100:2365: error: java.io.IOException: Cannot op= en >> > > > > filename >> > > > > > /hbase/users/73382377/data/312780071564432169 >> > > > > > java.io.IOException: Cannot open filename >> > > > > > /hbase/users/73382377/data/312780071564432169 >> > > > > > >> > > > > > The datanode log for 10.24.116.74 says: >> > > > > > >> > > > > > WARN org.apache.hadoop.hdfs.server.datanode.DataNode: >> > > > > DatanodeRegistration( >> > > > > > 10.24.166.74:50010, >> > > > > storageID=3DDS-14401423-10.24.166.74-50010-1270741415211, >> > > > > > infoPort=3D50075, ipcPort=3D50020): >> > > > > > Got exception while serving blk_-4841840178880951849_50277 to = / >> > > > > 10.25.119.113 >> > > > > > : >> > > > > > java.io.IOException: Block blk_-4841840178880951849_50277 is n= ot >> > > valid. >> > > > > > >> > > > > >> > > > > Whats your hadoop? =A0Is it 0.20.2 or CDH? =A0Any patches? >> > > > > >> > > > >> > > > Hadoop is vanilla CDH 2. =A0HBase is 0.20.3 + HBase-2180 >> > > > >> > > > >> > > > > >> > > > > >> > > > > > Running a major compaction on the users table fixed the proble= m >> the >> > > > first >> > > > > > time it happened, but this time the major compaction didn't fi= x >> it, >> > > so >> > > > > we're >> > > > > > in the process of rebooting the whole cluster. =A0I'm wonderin= g a >> few >> > > > > things: >> > > > > > >> > > > > > 1. What could trigger this problem? >> > > > > > 2. Why can't the system fail over to another >> > > block/file/datanode/region >> > > > > > server? =A0We're using 3x replication in HDFS, and we have 8 d= ata >> > nodes >> > > > > which >> > > > > > double as our region servers. >> > > > > > 3. Are there any best practices for achieving high availabilit= y >> in >> > an >> > > > > HBase >> > > > > > cluster? =A0How can I configure the system to gracefully (and >> > > > > automatically) >> > > > > > handle these types of problems? >> > > > > > >> > > > > >> > > > > Let us know what your hadoop is and then we figure more on the >> issues >> > > > > above. >> > > > > >> > > > >> > > > If you need complete stack traces or any additional information, >> please >> > > let >> > > > me know. >> > > > >> > > > >> > > > > Thanks James, >> > > > > St.Ack >> > > > > P.S. Its eight node cluster on what kinda hw? (You've probably s= aid >> > in >> > > > > the past and I can dig through mail -- just say -- and then what >> kind >> > > > > of loading are you applying? =A0Ditto for if you've said this >> already) >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > Todd Lipcon >> > > Software Engineer, Cloudera >> > > >> > >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > --=20 Todd Lipcon Software Engineer, Cloudera