Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2FDAE96D3 for ; Thu, 29 Mar 2012 17:10:53 +0000 (UTC) Received: (qmail 78942 invoked by uid 500); 29 Mar 2012 17:10:53 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 78907 invoked by uid 500); 29 Mar 2012 17:10:53 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 78898 invoked by uid 99); 29 Mar 2012 17:10:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2012 17:10:52 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2012 17:10:49 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 1BC5F34D3D3 for ; Thu, 29 Mar 2012 17:10:28 +0000 (UTC) Date: Thu, 29 Mar 2012 17:10:28 +0000 (UTC) From: "Cosmin Lehene (Updated) (JIRA)" To: issues@hbase.apache.org Message-ID: <327614761.33518.1333041028115.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <135070037.29540.1332960449814.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HBASE-5665: --------------------------------- Affects Version/s: 0.94.1 0.96.0 0.94.0 0.94 and trunk seem to suffer from this as well and not checking if parent has references. > Repeated split causes HRegionServer failures and breaks table > -------------------------------------------------------------- > > Key: HBASE-5665 > URL: https://issues.apache.org/jira/browse/HBASE-5665 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 0.92.0, 0.92.1, 0.94.0, 0.96.0, 0.94.1 > Reporter: Cosmin Lehene > Assignee: Cosmin Lehene > Priority: Blocker > Attachments: HBASE-5665-0.92.patch > > > Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable. > The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS. > The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever. > I was able to reproduce this on a smaller table consistently. > {code} > hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'} > hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"} > {code} > Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. > {code} > 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META > 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1.. compaction_queue=(0:1), split_queue=10 > 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124 > java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124 > at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363) > at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451) > at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237 > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1813) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544) > at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456) > at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341) > at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.(StoreFile.java:1008) > at org.apache.hadoop.hbase.io.HalfStoreFileReader.(HalfStoreFileReader.java:65) > at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467) > at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548) > at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284) > at org.apache.hadoop.hbase.regionserver.Store.(Store.java:221) > at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511) > at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450) > at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229) > at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504) > at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484) > ... 1 more > 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return > {code} > http://hastebin.com/diqinibajo.avrasm > later edit: > (I'm using the last 4 characters from each string) > Region 94e3 has storefile 7237 > Region 94e3 gets splited in daughters a: ffa1 and b: eee1 > Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77 > ffa1 has a reference: 7237.94e3 for it's store file > when ffa1 gets splited it will create another reference: 7237.94e3.ffa1 > when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] > {code} > "^([0-9a-f]+)(?:\\.(.+))?$" > {code} > and will attempt to go to /hbase/t1/[region] which resolves to > /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. > This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira