hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cosmin Lehene (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table
Date Wed, 28 Mar 2012 18:47:29 GMT
Repeated split causes HRegionServer failures and breaks table 
--------------------------------------------------------------

                 Key: HBASE-5665
                 URL: https://issues.apache.org/jira/browse/HBASE-5665
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 0.92.1, 0.92.0
            Reporter: Cosmin Lehene
            Priority: Blocker


Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the
table (and the cluster), unrecoverable.
The regionserver doing the split dies and the master will get into an infinite loop trying
to assign regions that seem to have the files missing from HDFS.

The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary
state forever.

I was able to reproduce this on a smaller table consistently.

{code}
hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
{code}

Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce
the issue almost instantly and consistently. 

{code}
2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region
t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split
requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1),
split_queue=10
2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
        at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
        at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
        at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
        at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
        at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
        at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
        at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
        at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
        at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
        ... 1 more
2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
{code}


http://hastebin.com/diqinibajo.avrasm

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message