hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-2244) META gets inconsistent in a number of crash scenarios
Date Thu, 25 Feb 2010 12:35:28 GMT

     [ https://issues.apache.org/jira/browse/HBASE-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack updated HBASE-2244:
-------------------------

    Attachment: 2244-v5.patch

Can't add unit test for this explicit case.  The actors are the BaseScanner up in Master and
a failing split over in a RegionServer.  The split region needs to be completely coherent;
that is, after split it has to have references in place else fixup won't cut in (that there
are references and there is no entry for a daughter is how we identify this pathology).  Then
we need to have regionserver fail at an explicit location in a cluster test, after offlining
parent but but somewhere in the
middle of adding the daughters to meta.  Our current testing vocabulary doesn't allow us do
this kind of detailed fail (Needs to be fixed but not as part of this patch).

So, I can see the repair cutting in if I doctor the RegionServer with the following patch
to CompactSplitThread:

{code}
Index: src/java/org/apache/hadoop/hbase/regionserver/CompactSplitThread.java
===================================================================
--- src/java/org/apache/hadoop/hbase/regionserver/CompactSplitThread.java       (revision
915869)
+++ src/java/org/apache/hadoop/hbase/regionserver/CompactSplitThread.java       (working copy)
@@ -199,6 +199,13 @@
         Writables.getBytes(newRegions[1].getRegionInfo()));
     t.put(put);
     
+    // If we crash here, then the daughters will not be added and we'll have
+    // and offlined parent but no daughters to take up the slack.  hbase-2244
+    // adds fixup to the metascanners.
+
+    // REMOVE
+    if (true) throw new IOException("Fail to add other regions and to send split to master");
+    
     // Add new regions to META
     for (int i = 0; i < newRegions.length; i++) {
       put = new Put(newRegions[i].getRegionName());

{code}

Notice the throw IOE after parent has been offlined.  This makes it so meta does not get daughter
entries and the master does not get split message.

In TestForceSplit, you'll see the following:

{code}
2010-02-24 19:45:16,685 ERROR [RegionServer:0.compactor] regionserver.CompactSplitThread(104):
Compaction/Split failed for region test,,1267069500020
java.io.IOException: Fail to add other regions and to send split to master
	at org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:207)
	at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:95)
2010-02-24 19:45:17,101 DEBUG [WaitOnSplit] client.TestForceSplit$WaitOnSplit(72): Cycle waiting
on split
2010-02-24 19:45:18,105 DEBUG [WaitOnSplit] client.TestForceSplit$WaitOnSplit(72): Cycle waiting
on split
2010-02-24 19:45:19,111 DEBUG [WaitOnSplit] client.TestForceSplit$WaitOnSplit(72): Cycle waiting
on split
2010-02-24 19:45:20,117 DEBUG [WaitOnSplit] client.TestForceSplit$WaitOnSplit(72): Cycle waiting
on split
2010-02-24 19:45:21,122 DEBUG [WaitOnSplit] client.TestForceSplit$WaitOnSplit(72): Cycle waiting
on split
2010-02-24 19:45:22,127 DEBUG [WaitOnSplit] client.TestForceSplit$WaitOnSplit(72): Cycle waiting
on split
2010-02-24 19:45:23,133 DEBUG [WaitOnSplit] client.TestForceSplit$WaitOnSplit(72): Cycle waiting
on split
2010-02-24 19:45:24,140 DEBUG [WaitOnSplit] client.TestForceSplit$WaitOnSplit(72): Cycle waiting
on split
2010-02-24 19:45:24,655 INFO  [Thread-103] master.ServerManager$ServerMonitor(129): 1 region
servers, 0 dead, average load 2.0
2010-02-24 19:45:24,920 INFO  [RegionManager.metaScanner] master.BaseScanner(163): RegionManager.metaScanner
scanning meta region {server: 127.0.0.1:59090, regionname: .META.,,1, startKey: <>}
2010-02-24 19:45:24,921 INFO  [RegionManager.rootScanner] master.BaseScanner(163): RegionManager.rootScanner
scanning meta region {server: 127.0.0.1:59090, regionname: -ROOT-,,0, startKey: <>}
2010-02-24 19:45:24,941 INFO  [RegionManager.rootScanner] master.BaseScanner(242): RegionManager.rootScanner
scan of 1 row(s) of meta region {server: 127.0.0.1:59090, regionname: -ROOT-,,0, startKey:
<>} complete
2010-02-24 19:45:24,942 WARN  [RegionManager.metaScanner] master.BaseScanner(379): Fixup broke
split: Add missing split daughter to meta, daughter=REGION => {NAME => 'test,,1267069516539',
STARTKEY => '', ENDKEY => 'lqg', ENCODED => 904299994, TABLE => {{NAME => 'test',
FAMILIES => [{NAME => 'a', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}, parent=REGION
=> {NAME => 'test,,1267069500020', STARTKEY => '', ENDKEY => '', ENCODED =>
970247455, OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'test', FAMILIES
=> [{NAME => 'a', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
2010-02-24 19:45:24,947 WARN  [RegionManager.metaScanner] master.BaseScanner(379): Fixup broke
split: Add missing split daughter to meta, daughter=REGION => {NAME => 'test,lqg,1267069516539',
STARTKEY => 'lqg', ENDKEY => '', ENCODED => 1813617937, TABLE => {{NAME =>
'test', FAMILIES => [{NAME => 'a', VERSIONS => '3', COMPRESSION => 'NONE', TTL
=> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}},
parent=REGION => {NAME => 'test,,1267069500020', STARTKEY => '', ENDKEY => '',
ENCODED => 970247455, OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'test',
FAMILIES => [{NAME => 'a', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
2010-02-24 19:45:24,948 INFO  [RegionManager.metaScanner] master.BaseScanner(242): RegionManager.metaScanner
scan of 1 row(s) of meta region {server: 127.0.0.1:59090, regionname: .META.,,1, startKey:
<>} complete
2010-02-24 19:45:24,949 INFO  [RegionManager.metaScanner] master.MetaScanner(132): All 1 .META.
region(s) scanned
2010-02-24 19:45:25,145 DEBUG [WaitOnSplit] client.HConnectionManager$TableServers(780): Cache
hit for row <> in tableName .META.: location server 127.0.0.1:59090, location region
name .META.,,1
2010-02-24 19:45:25,149 DEBUG [WaitOnSplit] client.HTable$ClientScanner(1832): Creating scanner
over .META. starting at key ''
2010-02-24 19:45:25,151 DEBUG [WaitOnSplit] client.HTable$ClientScanner(1928): Advancing internal
scanner to startKey at ''
2010-02-24 19:45:25,151 DEBUG [WaitOnSplit] client.HConnectionManager$TableServers(780): Cache
hit for row <> in tableName .META.: location server 127.0.0.1:59090, location region
name .META.,,1
2010-02-24 19:45:25,157 INFO  [WaitOnSplit] client.TestForceSplit$WaitOnSplit(79): keyvalues={test,,1267069500020/info:regioninfo/1267069516682/Put/vlen=254,
test,,1267069500020/info:server/1267069501165/Put/vlen=15, test,,1267069500020/info:serverstartcode/1267069501165/Put/vlen=8,
test,,1267069500020/info:splitA/1267069516682/Put/vlen=257, test,,1267069500020/info:splitA_checked/1267069524944/Put/vlen=1,
test,,1267069500020/info:splitB/1267069516682/Put/vlen=260, test,,1267069500020/info:splitB_checked/1267069524948/Put/vlen=1}
2010-02-24 19:45:25,158 INFO  [WaitOnSplit] client.TestForceSplit$WaitOnSplit(79): keyvalues={test,,1267069516539/info:regioninfo/1267069524943/Put/vlen=257}
2010-02-24 19:45:25,158 INFO  [WaitOnSplit] client.TestForceSplit$WaitOnSplit(79): keyvalues={test,lqg,1267069516539/info:regioninfo/1267069524948/Put/vlen=260}
2010-02-24 19:45:25,159 DEBUG [WaitOnSplit] client.HTable$ClientScanner(1915): Finished with
scanning at REGION => {NAME => '.META.,,1', STARTKEY => '', ENDKEY => '', ENCODED
=> 1028785192, TABLE => {{NAME => '.META.', IS_META => 'true', MEMSTORE_FLUSHSIZE
=> '16384', FAMILIES => [{NAME => 'historian', VERSIONS => '2147483647', COMPRESSION
=> 'NONE', TTL => '604800', BLOCKSIZE => '8192', IN_MEMORY => 'false', BLOCKCACHE
=> 'false'}, {NAME => 'info', VERSIONS => '10', COMPRESSION => 'NONE', TTL =>
'2147483647', BLOCKSIZE => '8192', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
2010-02-24 19:45:25,160 ERROR [IPC Server handler 4 on 59090] regionserver.HRegionServer(862):

org.apache.hadoop.hbase.NotServingRegionException: test,,1267069500020 is closing=true or
closed=true
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:1787)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1907)
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2010-02-24 19:45:25,175 DEBUG [main] client.HTable$ClientScanner(1928): Advancing internal
scanner to startKey at 'aaa'
2010-02-24 19:45:25,177 ERROR [IPC Server handler 2 on 59090] regionserver.HRegionServer(864):
Failed openScanner
org.apache.hadoop.hbase.NotServingRegionException: test,,1267069500020
	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2279)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1858)
	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2010-02-24 19:45:30,186 DEBUG [main] client.HConnectionManager$TableServers(857): Removed
test,,1267069500020 for tableName=test from cache because of aaa
2010-02-24 19:45:30,188 DEBUG [main] client.HConnectionManager$TableServers(734): locateRegionInMeta
attempt 0 of 3 failed; retrying after sleep of 5000 because: No server address listed in .META.
for region test,,1267069516539
2010-02-24 19:45:34,656 INFO  [Thread-103] master.ServerManager$ServerMonitor(129): 1 region
servers, 0 dead, average load 2.0
2010-02-24 19:45:34,920 INFO  [RegionManager.rootScanner] master.BaseScanner(163): RegionManager.rootScanner
scanning meta region {server: 127.0.0.1:59090, regionname: -ROOT-,,0, startKey: <>}
2010-02-24 19:45:34,920 INFO  [RegionManager.metaScanner] master.BaseScanner(163): RegionManager.metaScanner
scanning meta region {server: 127.0.0.1:59090, regionname: .META.,,1, startKey: <>}
2010-02-24 19:45:34,932 DEBUG [RegionManager.metaScanner] master.BaseScanner(572): Current
assignment of test,,1267069516539 is not valid;  serverAddress=, startCode=0 unknown.
2010-02-24 19:45:34,934 DEBUG [RegionManager.metaScanner] master.BaseScanner(572): Current
assignment of test,lqg,1267069516539 is not valid;  serverAddress=, startCode=0 unknown.
2010-02-24 19:45:34,940 INFO  [RegionManager.rootScanner] master.BaseScanner(242): RegionManager.rootScanner
scan of 1 row(s) of meta region {server: 127.0.0.1:59090, regionname: -ROOT-,,0, startKey:
<>} complete
2010-02-24 19:45:34,941 INFO  [RegionManager.metaScanner] master.BaseScanner(242): RegionManager.metaScanner
scan of 3 row(s) of meta region {server: 127.0.0.1:59090, regionname: .META.,,1, startKey:
<>} complete
2010-02-24 19:45:34,941 INFO  [RegionManager.metaScanner] master.MetaScanner(132): All 1 .META.
region(s) scanned
2010-02-24 19:45:35,189 DEBUG [main] client.HConnectionManager$TableServers(734): locateRegionInMeta
attempt 1 of 3 failed; retrying after sleep of 5000 because: No server address listed in .META.
for region test,,1267069516539
2010-02-24 19:45:35,315 INFO  [IPC Server handler 4 on 60000] master.RegionManager(337): Assigning
region test,,1267069516539 to localhost,59090,1267069494948
2010-02-24 19:45:35,315 INFO  [IPC Server handler 4 on 60000] master.RegionManager(337): Assigning
region test,lqg,1267069516539 to localhost,59090,1267069494948
2010-02-24 19:45:35,316 INFO  [RegionServer:0] regionserver.HRegionServer(499): MSG_REGION_OPEN:
test,,1267069516539
2010-02-24 19:45:35,317 INFO  [RegionServer:0] regionserver.HRegionServer(499): MSG_REGION_OPEN:
test,lqg,1267069516539
{code}

2010-02-24 19:26:50,203 and 2010-02-24 19:26:50,209 are the new stuff.  See how subsequently
the daughter rows added by the fixup are assigned and the verification scan -- the test is
to prove scans can ride across splits -- completes and the unit tests fails.

I'm going to commit the patch unless objection.

This patch adds some load on .META.: two new Gets after every split.  Shouldn't be too bad.
 Splits are rare enough.

> META gets inconsistent in a number of crash scenarios
> -----------------------------------------------------
>
>                 Key: HBASE-2244
>                 URL: https://issues.apache.org/jira/browse/HBASE-2244
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.4
>
>         Attachments: 2244-v3.patch, 2244-v5.patch, 2244.patch
>
>
> (Forking this issue off from HBASE-2235).
> During load testing, in a number of failure scenarios (unexpected region server deaths)
etc., we notice that META can get inconsistent. This primarily happens for regions which are
in the process of being split. Manually running add_table.rb seems to fix the tables meta
data just fine. 
> But it would be good to do automatic cleansing (as part of META scanners work) and/or
avoid these inconsistent states altogether.
> For example, for a particular startkey, I see all these entries:
> {code}
> test1,1204765,1266569946560 column=info:regioninfo, timestamp=1266581302018, value=REGION
=> {NAME => 'test1,
>                              1204765,1266569946560', STARTKEY => '1204765', ENDKEY
=> '1441091', ENCODED => 18
>                              19368969, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'test1', FAMILIES =>
>                               [{NAME => 'actions', VERSIONS => '3', COMPRESSION
=> 'NONE', TTL => '2147483647'
>                              , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}]}}
>  test1,1204765,1266569946560 column=info:server, timestamp=1266570029133, value=10.129.68.212:60020
>  test1,1204765,1266569946560 column=info:serverstartcode, timestamp=1266570029133, value=1266562597546
>  test1,1204765,1266569946560 column=info:splitB, timestamp=1266581302018, value=\x00\x071441091\x00\x00\x00\x0
>                              1\x26\xE6\x1F\xDF\x27\x1Btest1,1290703,1266581233447\x00\x071290703\x00\x00\x00\x
>                              05\x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x
>                              00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00
>                              \x00\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSI
>                              ON\x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TT
>                              L\x00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00
>                              \x00\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04t
>                              rueh\x0FQ\xCF
>  test1,1204765,1266581233447 column=info:regioninfo, timestamp=1266609172177, value=REGION
=> {NAME => 'test1,
>                              1204765,1266581233447', STARTKEY => '1204765', ENDKEY
=> '1290703', ENCODED => 13
>                              73493090, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'test1', FAMILIES =>
>                               [{NAME => 'actions', VERSIONS => '3', COMPRESSION
=> 'NONE', TTL => '2147483647'
>                              , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}]}}
>  test1,1204765,1266581233447 column=info:server, timestamp=1266604768670, value=10.129.68.213:60020
>  test1,1204765,1266581233447 column=info:serverstartcode, timestamp=1266604768670, value=1266562597511
>  test1,1204765,1266581233447 column=info:splitA, timestamp=1266609172177, value=\x00\x071226169\x00\x00\x00\x0
>                              1\x26\xE7\xCA,\x7D\x1Btest1,1204765,1266609171581\x00\x071204765\x00\x00\x00\x05\
>                              x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
>                              x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
>                              0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
>                              x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
>                              00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
>                              0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
>                              \xB9\xBD\xFEO
>  test1,1204765,1266581233447 column=info:splitB, timestamp=1266609172177, value=\x00\x071290703\x00\x00\x00\x0
>                              1\x26\xE7\xCA,\x7D\x1Btest1,1226169,1266609171581\x00\x071226169\x00\x00\x00\x05\
>                              x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
>                              x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
>                              0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
>                              x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
>                              00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
>                              0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
>                              \xE1\xDF\xF8p
>  test1,1204765,1266609171581 column=info:regioninfo, timestamp=1266609172212, value=REGION
=> {NAME => 'test1,
>                              1204765,1266609171581', STARTKEY => '1204765', ENDKEY
=> '1226169', ENCODED => 21
>                              34878372, TABLE => {{NAME => 'test1', FAMILIES =>
[{NAME => 'actions', VERSIONS =
>                              > '3', COMPRESSION => 'NONE', TTL => '2147483647',
BLOCKSIZE => '65536', IN_MEMOR
>                              Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message