Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-dev@hadoop.apache.org
Message-ID: <1082491813.1254493523495.JavaMail.jira@brutus>
Date: Fri, 2 Oct 2009 07:25:23 -0700 (PDT)
From: "stack (JIRA)" <jira@apache.org>
To: hbase-dev@hadoop.apache.org
Subject: [jira] Created: (HBASE-1882) Put bad edits into a failed edits log
 rather than puke and crash
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Put bad edits into a failed edits log rather than puke and crash
----------------------------------------------------------------

                 Key: HBASE-1882
                 URL: https://issues.apache.org/jira/browse/HBASE-1882
             Project: Hadoop HBase
          Issue Type: Bug
            Reporter: stack


Bits got flipped (it seems) on a read so a key going into hfile was < the previous.  We throw an exception that causes HRS restart.  Just put bad edits aside rather than die.

>From elsif up on list:

{code}
Our HBase system ended up in a looping situation trying to continuously
re-assign a damaged region across the HBase cluster. We could not properly
scan or store data in the affected table.

The triggering event that caused this cascade of errors was an
java.io.IOException: Added a key not lexically larger than previous

>From the HBase shell "scan '.META.' command we confirmed the name of the
damaged encoded
region stored in hdfs. In an attempt to fix this, the data directory for
the impacted region
was moved off hdfs and the region was able to be restarted with a blank
slate.

Is there a better way to handle this type of failure?

Is there a way to generate an hlog to re-import the data files we moved
away?

HBase Version: 0.20.0, r805538
Hadoop Version: 0.20.0-plus4681, r767961

Here are the log entries leading up to the event:

2009-09-25 06:57:53,836 INFO org.apache.hadoop.hbase.regionserver.HLog:
Roll /hbase/.logs/hfs-030035,60020,1253122636703/hlog.dat.1253883473798,
entries=1479, calcsize=49659443, filesize=49589196. New hlog
/hbase/.logs/hfs-030035,60020,1253122636703/hlog.dat.1253887073833
2009-09-25 07:05:50,089 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
fc_test,\x2Fdata\x2Fmalware\x2F873fc5b61af807827381f120ad7d746984d49426eb3c8b5523b6142285ee0844027f6b45eb8f18aff21b52071a9664e05ceca112a9ff5949c16cda3f27c9c7c2,1253728360248

because global memstore limit of 396.9m exceeded; currently 397.0m and
flushing till 24
8.1m


2009-09-25 07:05:53,294 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
fc_test,\x2Fdata\x2Fdir\x2Fdfd2e9615461d03ba7c22d6d804ec23c3dba1587ffb91736d0e90df0b8aa659d6f55efe49552a83271d9e115bd1d706b865dc42008

52493400819628a7be0fc2\x2F2009-09-23_143101\x2Fxml,1253765882755 because
global memstore limit of 396.9m exceeded; currently 357.1m and flushing
till
248.1m

2009-09-25 07:05:55,583 FATAL
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
required. Forcing serve
r
shutdown

org.apache.hadoop.hbase.DroppedSnapshotException: region:
fc_test,\x2Fdata\x2Fdir\x2Fdfd2e9615461d03ba7c22d6d804ec23c3dba1587ffb91736d0e90df0b8aa659d6f55efe49552a83271d9e115bd1d706b865dc4200852493400819628a7be0fc2\x2F2009-09-23_143101\

x2Fxml,1253765882755


at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:950)

at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:843)
   at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)

at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushSomeRegions(MemStoreFlusher.java:352)

at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:321)

   at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1809)


   at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   at java.lang.reflect.Method.invoke(Unknown Source)
   at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
   at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
Caused by: java.io.IOException: Added a key not lexically larger than
previous
key=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4^D11ff79bb955c50b99c8f80fdff0b4beb413d8ea/2009-09-25_034206^Ejson:^@^@^A#?M?)^D,

lastkey=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4d11ff79bb955c50b99c8f80fdff0b4beb413d8ea^Ejson:^@^@^A#?M?#^D

   at
org.apache.hadoop.hbase.io.hfile.HFile$Writer.checkKey(HFile.java:517)
   at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:479)
   at org.apache.hadoop.hbase.io.hfile.HFile$Writer.append(HFile.java:447)
   at
org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:525)

   at
org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:489)
2009-09-25 07:05:55,608 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=15,
 stores=30, storefiles=43, storefileIndexSize=3, memstoreSize=396,
usedHeap=540, maxHeap=992, blockCacheSize=73256760, blo
ckCacheFree=967258312, blockCacheCount=1128, blockCacheHitRatio=90
2009-09-25 07:05:55,609 WARN
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush failed
2009-09-25 07:05:55,684 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020


Here are the log entries after restarting:

2009-09-28 11:36:16,608 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 0 on 60020: starting
2009-09-28 11:36:16,620 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.NotServingRegionException:
[B@1a001ff               at
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionS
erver.java:2263)

at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.
java:1767)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown
Source)                  at
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
       at java.lang.reflect.Method.invoke(Unknown
Source)                              at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:650)
       at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

{code}

....

and

{code}
stack wrote:
> On Mon, Sep 28, 2009 at 3:27 PM, elsif <elsif.then@gmail.com> wrote:
>
>
>> Our HBase system ended up in a looping situation trying to continuously
>> re-assign a damaged region across the HBase cluster. We could not properly
>> scan or store data in the affected table.
>>
>> The triggering event that caused this cascade of errors was an
>> java.io.IOException: Added a key not lexically larger than previous
>>
>>
>
>
> Here are the offending keys purportedly:
>
> key=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4^D11ff79bb955c50b99c8f80fdff0b4beb413d8ea/2009-09-25_034206^Ejson:^@^@^A#?M?)^D,
> lastkey=^@?/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4d11ff79bb955c50b99c8f80fdff0b4beb413d8ea^Ejson:^@^@^A#?M?#^D
>
>
>
> It seems like keys are fine till we get to '^D'.    Can you make these keys
> or comment on them?  The '^D' is a printable version of whatever the bit of
> binary was here.  Do you have an idea what it was?  Can you remanufacture
> this condition?  Something in our comparator is messing up?  Is that
> possible?
>
>

The keys are all plain text strings with no special characters.  Not
sure where the '^D' would come from since the same processes is used to
generate all the keys.
> This is in .META. table?
>

This is from a regular table.
>
>
>
>
>
> >From the HBase shell "scan '.META.' command we confirmed the name of the
>
>> damaged encoded
>> region stored in hdfs. In an attempt to fix this, the data directory for
>> the impacted region
>> was moved off hdfs and the region was able to be restarted with a blank
>> slate.
>>
>> Is there a better way to handle this type of failure?
>>
>>
>>
>
> There is a script that will repair the broke files rewriting them removing
> the offending edit.  I'd point you at the script only its up in an Apache
> JIRA and thats sick at the moment.
>
> You could try running:
>
> ./bin/hbase org.apache.hadoop.hbase.io.hfile.HFile
>
> It has diagnostic and outputting facility.  Pass it the bad files.
>
>
>
I scanned each of the files with the -k option, no warnings were generated.

I also extracted all the key values from each file - none of them appear
to contain the key with the '^D'.

The 'key' and 'lastkey' listed above were contained in the
oldlogfile.log.  I opened the oldlogfile.log with a hex editor and
verified that the key does not contain any binary characters where the
'^D' is shown in the error log.  The character is actually a lowercase 'd':

/data/dir/e22677afdb73cc17ed82a974058859e5e74b78ca5a7917cceaa500d2dd3198094ecf91792e8ddafc4768cfb4d11ff79bb955c50b99c8f80fdff0b4beb413d8ea/2009-09-25_034206

It would seem this was a read error of some kind.

>
>
>> Is there a way to generate an hlog to re-import the data files we moved
>> away?
>>
>>
>>
>
> Above mentioned script is probably the better way to go.
>
>
>
>
>> HBase Version: 0.20.0, r805538
>> Hadoop Version: 0.20.0-plus4681, r767961
>>
>>
>>
> Are these release 0.20.0?
>
The hadoop is release 0.20.0 - the hbase is a pre-release svn checkout.
> St.Ack
>
>
>
>
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.