Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 51538 invoked from network); 2 Oct 2009 14:25:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Oct 2009 14:25:49 -0000 Received: (qmail 15773 invoked by uid 500); 2 Oct 2009 14:25:48 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 15703 invoked by uid 500); 2 Oct 2009 14:25:48 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 15691 invoked by uid 99); 2 Oct 2009 14:25:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Oct 2009 14:25:48 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Oct 2009 14:25:45 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 7A447234C004 for ; Fri, 2 Oct 2009 07:25:23 -0700 (PDT) Message-ID: <1082491813.1254493523495.JavaMail.jira@brutus> Date: Fri, 2 Oct 2009 07:25:23 -0700 (PDT) From: "stack (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Created: (HBASE-1882) Put bad edits into a failed edits log rather than puke and crash MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Put bad edits into a failed edits log rather than puke and crash ---------------------------------------------------------------- Key: HBASE-1882 URL: https://issues.apache.org/jira/browse/HBASE-1882 Project: Hadoop HBase Issue Type: Bug Reporter: stack Bits got flipped (it seems) on a read so a key going into hfile was < the previous. We throw an exception that causes HRS restart. Just put bad edits aside rather than die. >From elsif up on list: {code} Our HBase system ended up in a looping situation trying to continuously re-assign a damaged region across the HBase cluster. We could not properly scan or store data in the affected table. The triggering event that caused this cascade of errors was an java.io.IOException: Added a key not lexically larger than previous >From the HBase shell "scan '.META.' command we confirmed the name of the damaged encoded region stored in hdfs. In an attempt to fix this, the data directory for the impacted region was moved off hdfs and the region was able to be restarted with a blank slate. Is there a better way to handle this type of failure? Is there a way to generate an hlog to re-import the data files we moved away? HBase Version: 0.20.0, r805538 Hadoop Version: 0.20.0-plus4681, r767961 Here are the log entries leading up to the event: 2009-09-25 06:57:53,836 INFO org.apache.hadoop.hbase.regionserver.HLog: Roll /hbase/.logs/hfs-030035,60020,1253122636703/hlog.dat.1253883473798, entries=1479, calcsize=49659443, filesize=49589196. New hlog /hbase/.logs/hfs-030035,60020,1253122636703/hlog.dat.1253887073833 2009-09-25 07:05:50,089 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of fc_test,\x2Fdata\x2Fmalware\x2F873fc5b61af807827381f120ad7d746984d49426eb3c8b5523b6142285ee0844027f6b45eb8f18aff21b52071a9664e05ceca112a9ff5949c16cda3f27c9c7c2,1253728360248 because global memstore limit of 396.9m exceeded; currently 397.0m and flushing till 24 8.1m 2009-09-25 07:05:53,294 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of fc_test,\x2Fdata\x2Fdir\x2Fdfd2e9615461d03ba7c22d6d804ec23c3dba1587ffb91736d0e90df0b8aa659d6f55efe49552a83271d9e115bd1d706b865dc42008 52493400819628a7be0fc2\x2F2009-09-23_143101\x2Fxml,1253765882755 because global memstore limit of 396.9m exceeded; currently 357.1m and flushing till 248.1m 2009-09-25 07:05:55,583 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog required. Forcing serve r shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: fc_test,\x2Fdata\x2Fdir\x2Fdfd2e9615461d03ba7c22d6d804ec23c3dba1587ffb91736d0e90df0b8aa659d6f55efe49552a83271d9e115bd1d706b865dc4200852493400819628a7be0fc2\x2F2009-09-23_143101\ x2Fxml,1253765882755 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:950) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:843) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushSomeRegions(MemStoreFlusher.java:352) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:321) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1809) at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) Caused by: java.io.IOException: Added a key not lexically larger than previous key=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4^D11ff79bb955c50b99c8f80fdff0b4beb413d8ea/2009-09-25_034206^Ejson:^@^@^A#?M?)^D, lastkey=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4d11ff79bb955c50b99c8f80fdff0b4beb413d8ea^Ejson:^@^@^A#?M?#^D at org.apache.hadoop.hbase.io.hfile.HFile$Writer.checkKey(HFile.java:517) at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:479) at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:447) at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:525) at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:489) 2009-09-25 07:05:55,608 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=15, stores=30, storefiles=43, storefileIndexSize=3, memstoreSize=396, usedHeap=540, maxHeap=992, blockCacheSize=73256760, blo ckCacheFree=967258312, blockCacheCount=1128, blockCacheHitRatio=90 2009-09-25 07:05:55,609 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush failed 2009-09-25 07:05:55,684 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020 Here are the log entries after restarting: 2009-09-28 11:36:16,608 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60020: starting 2009-09-28 11:36:16,620 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.NotServingRegionException: [B@1a001ff at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionS erver.java:2263) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer. java:1767) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) {code} .... and {code} stack wrote: > On Mon, Sep 28, 2009 at 3:27 PM, elsif wrote: > > >> Our HBase system ended up in a looping situation trying to continuously >> re-assign a damaged region across the HBase cluster. We could not properly >> scan or store data in the affected table. >> >> The triggering event that caused this cascade of errors was an >> java.io.IOException: Added a key not lexically larger than previous >> >> > > > Here are the offending keys purportedly: > > key=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4^D11ff79bb955c50b99c8f80fdff0b4beb413d8ea/2009-09-25_034206^Ejson:^@^@^A#?M?)^D, > lastkey=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4d11ff79bb955c50b99c8f80fdff0b4beb413d8ea^Ejson:^@^@^A#?M?#^D > > > > It seems like keys are fine till we get to '^D'. Can you make these keys > or comment on them? The '^D' is a printable version of whatever the bit of > binary was here. Do you have an idea what it was? Can you remanufacture > this condition? Something in our comparator is messing up? Is that > possible? > > The keys are all plain text strings with no special characters. Not sure where the '^D' would come from since the same processes is used to generate all the keys. > This is in .META. table? > This is from a regular table. > > > > > > >From the HBase shell "scan '.META.' command we confirmed the name of the > >> damaged encoded >> region stored in hdfs. In an attempt to fix this, the data directory for >> the impacted region >> was moved off hdfs and the region was able to be restarted with a blank >> slate. >> >> Is there a better way to handle this type of failure? >> >> >> > > There is a script that will repair the broke files rewriting them removing > the offending edit. I'd point you at the script only its up in an Apache > JIRA and thats sick at the moment. > > You could try running: > > ./bin/hbase org.apache.hadoop.hbase.io.hfile.HFile > > It has diagnostic and outputting facility. Pass it the bad files. > > > I scanned each of the files with the -k option, no warnings were generated. I also extracted all the key values from each file - none of them appear to contain the key with the '^D'. The 'key' and 'lastkey' listed above were contained in the oldlogfile.log. I opened the oldlogfile.log with a hex editor and verified that the key does not contain any binary characters where the '^D' is shown in the error log. The character is actually a lowercase 'd': /data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4d11ff79bb955c50b99c8f80fdff0b4beb413d8ea/2009-09-25_034206 It would seem this was a read error of some kind. > > >> Is there a way to generate an hlog to re-import the data files we moved >> away? >> >> >> > > Above mentioned script is probably the better way to go. > > > > >> HBase Version: 0.20.0, r805538 >> Hadoop Version: 0.20.0-plus4681, r767961 >> >> >> > Are these release 0.20.0? > The hadoop is release 0.20.0 - the hbase is a pre-release svn checkout. > St.Ack > > > > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.