hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy
Date Thu, 08 Sep 2011 14:55:08 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100367#comment-13100367
] 

ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------

@J-D

bq.You could also try doing a worst case cold startup by killing -9 all HBase components at
the same time (more or less) and then restarting them all (also after data was added). Finally
you could try setting a super low timeout setting, like 5 seconds, to trigger RIT timeouts
by the hundreds.

I conducted the tests again particularly with 5 secs time out. Killed the cluster, started
again, Randomly killed RS -> invoked balancer command also.
I was able to get back all the regions (4003 regions) among 3 RS.
hbck result was also positive
{noformat}
***** The number of timed out regions **** 938
***** The number of timed out regions **** 270
***** The number of timed out regions **** 673
***** The number of timed out regions **** 269
***** The number of timed out regions **** 941
***** The number of timed out regions **** 942
***** The number of timed out regions **** 941
{noformat}

{noformat}
Summary:
  -ROOT- is okay.
    Number of regions: 1
    Deployed on:  HOST-10-18-52-253,60020,1315480076091
  .META. is okay.
    Number of regions: 1
    Deployed on:  HOST-10-18-52-253,60020,1315480076091
  testram2 is okay.
    Number of regions: 4001
    Deployed on:  HOST-10-18-52-108,60020,1315480229321 HOST-10-18-52-253,60020,1315480076091
0 inconsistencies detected.
Status: OK
{noformat}

> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
>                 Key: HBASE-4015
>                 URL: https://issues.apache.org/jira/browse/HBASE-4015
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4015_1_trunk.patch, HBASE-4015_2_trunk.patch, HBASE-4015_reprepared_trunk_2.patch,
Timeoutmonitor with state diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition generator,
mostly making things worse rather than better. It does it's own thing for a while without
caring for what's happening in the rest of the master.
> The first thing that needs to happen is that the regions should not be processed in one
big batch, because that sometimes can take minutes to process (meanwhile a region that timed
out opening might have opened, then what happens is it will be reassigned by the TimeoutMonitor
generating the never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure how to do
it in a scalable way in this case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message