hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kannan Muthukkaruppan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2244) META gets inconsistent in a number of crash scenarios
Date Sun, 21 Feb 2010 18:55:27 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836422#action_12836422
] 

Kannan Muthukkaruppan commented on HBASE-2244:
----------------------------------------------

Stack wrote: <<<< In the .META. listing posted above, there are some interesting
issues. We still have a reference to a daughter, splitB, in the first offlined (row) region,
yet the next row is a daughter that has been offlined itself. There may be a race in here
if we're splitting fast. Let me check it out and see if a fix.>>>

Yes, I see several times that nested splits are happening, but the offlined parent row hasn't
been reaped. But perhaps that in itself isn't an issue.  For example, corresponding to my
first .META. snippet in this JIRA:

The split of test1,1204765,1266569946560 was announced @4:08:

{code}
2010-02-19 04:08:23,764 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_SPLIT:
test1,1204765,1266569946560: Daughters; test1,1204765,1266581233447, test1,1290703,1266581233447
from test013.abcxyz.com,60020,1266562597546; 1 of 3
{code}

But reclaiming the offlined parent row from .META. took time. First we  detected one of the
daughters no longer reference it @ about 11:53:
{code}
2010-02-19 11:53:46,673 DEBUG org.apache.hadoop.hbase.master.BaseScanner: test1,1204765,1266581233447/1373493090
no longer has references to test1,1204765,1266569946560
{code}

And the second daughter at about 14:01. It is only at this point we delete the offlined parent
row:
{code}
2010-02-19 14:01:48,283 DEBUG org.apache.hadoop.hbase.master.BaseScanner: test1,1290703,1266581233447/580635726
no longer has references to test1,1204765,1266569946560
2010-02-19 14:01:48,299 INFO org.apache.hadoop.hbase.master.BaseScanner: Deleting region test1,1204765,1266569946560
(encoded=1819368969) because daughter splits no longer hold references
{code}

Naturally, given this wide window it is not uncommon to see rows corresponding to nested splits
in .META. In most of these cases, eventually the .META. seems to fix itself. But it still
seems odd to me that it takes so much time. 

During one of these situations, I saw the client get errors of the form:

10/02/19 09:09:37 INFO tests.MultiThreadedWriter: [22] Users = 1052116, mails = 1M, time =
10:10:53
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server
10.129.68.212:60020 for region\
 test1,1204765,1266581233447, row '1232785', but failed after 10 attempts.

and assumed that this was related to the .META. being in a wierd state (i.e. offlined parent
not being deleted). But looking at the logs, these client errors happened during a smaller
period (8:49 to 9:09). And were likely due to other load issues on the particular region server.
I will post any findings from that RS'es logs shortly.






> META gets inconsistent in a number of crash scenarios
> -----------------------------------------------------
>
>                 Key: HBASE-2244
>                 URL: https://issues.apache.org/jira/browse/HBASE-2244
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.4
>
>
> (Forking this issue off from HBASE-2235).
> During load testing, in a number of failure scenarios (unexpected region server deaths)
etc., we notice that META can get inconsistent. This primarily happens for regions which are
in the process of being split. Manually running add_table.rb seems to fix the tables meta
data just fine. 
> But it would be good to do automatic cleansing (as part of META scanners work) and/or
avoid these inconsistent states altogether.
> For example, for a particular startkey, I see all these entries:
> {code}
> test1,1204765,1266569946560 column=info:regioninfo, timestamp=1266581302018, value=REGION
=> {NAME => 'test1,
>                              1204765,1266569946560', STARTKEY => '1204765', ENDKEY
=> '1441091', ENCODED => 18
>                              19368969, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'test1', FAMILIES =>
>                               [{NAME => 'actions', VERSIONS => '3', COMPRESSION
=> 'NONE', TTL => '2147483647'
>                              , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}]}}
>  test1,1204765,1266569946560 column=info:server, timestamp=1266570029133, value=10.129.68.212:60020
>  test1,1204765,1266569946560 column=info:serverstartcode, timestamp=1266570029133, value=1266562597546
>  test1,1204765,1266569946560 column=info:splitB, timestamp=1266581302018, value=\x00\x071441091\x00\x00\x00\x0
>                              1\x26\xE6\x1F\xDF\x27\x1Btest1,1290703,1266581233447\x00\x071290703\x00\x00\x00\x
>                              05\x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x
>                              00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00
>                              \x00\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSI
>                              ON\x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TT
>                              L\x00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00
>                              \x00\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04t
>                              rueh\x0FQ\xCF
>  test1,1204765,1266581233447 column=info:regioninfo, timestamp=1266609172177, value=REGION
=> {NAME => 'test1,
>                              1204765,1266581233447', STARTKEY => '1204765', ENDKEY
=> '1290703', ENCODED => 13
>                              73493090, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'test1', FAMILIES =>
>                               [{NAME => 'actions', VERSIONS => '3', COMPRESSION
=> 'NONE', TTL => '2147483647'
>                              , BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}]}}
>  test1,1204765,1266581233447 column=info:server, timestamp=1266604768670, value=10.129.68.213:60020
>  test1,1204765,1266581233447 column=info:serverstartcode, timestamp=1266604768670, value=1266562597511
>  test1,1204765,1266581233447 column=info:splitA, timestamp=1266609172177, value=\x00\x071226169\x00\x00\x00\x0
>                              1\x26\xE7\xCA,\x7D\x1Btest1,1204765,1266609171581\x00\x071204765\x00\x00\x00\x05\
>                              x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
>                              x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
>                              0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
>                              x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
>                              00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
>                              0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
>                              \xB9\xBD\xFEO
>  test1,1204765,1266581233447 column=info:splitB, timestamp=1266609172177, value=\x00\x071290703\x00\x00\x00\x0
>                              1\x26\xE7\xCA,\x7D\x1Btest1,1226169,1266609171581\x00\x071226169\x00\x00\x00\x05\
>                              x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
>                              x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
>                              0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
>                              x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
>                              00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
>                              0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
>                              \xE1\xDF\xF8p
>  test1,1204765,1266609171581 column=info:regioninfo, timestamp=1266609172212, value=REGION
=> {NAME => 'test1,
>                              1204765,1266609171581', STARTKEY => '1204765', ENDKEY
=> '1226169', ENCODED => 21
>                              34878372, TABLE => {{NAME => 'test1', FAMILIES =>
[{NAME => 'actions', VERSIONS =
>                              > '3', COMPRESSION => 'NONE', TTL => '2147483647',
BLOCKSIZE => '65536', IN_MEMOR
>                              Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message