hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy
Date Fri, 12 Aug 2011 08:52:27 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084007#comment-13084007
] 

ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------

bq. You are working on TRUNK Ram?
Yes Stack

bq. Won't your code have to check for both REALLOCATE and OFFLINE and the presence of either
mean its ok to procede to OPENING (and then aren't REALLOCATE and OFFLINE the 'same' state
because the presence of either will mean proceed to OPENING?).

Yes this is what my patch does.  But why we do the same operation for both state?
this is because previously if there is a change in state other than OFFLINE while moving to
OPENING we were aborting, now this an additional state which says its ok to go to OPENING
if you find me in RE_ALLOCATE and if the server name in me is same as your RS address. This
avoids the problem of unnecessary region getting hijacked though the RS was doing his work
correctly.

bq.So, why not just add machine name to OFFLINE? Then we don't need REALLOCATE state? 
This you have already told like currently there is no version that is passed from master to
rs. Thats why a new state.  If this had been possible then OFFLINE with version passed by
master would have been sufficient.

bq.So, figuring how to do deal with timeout of regions in PENDING_OPEN is one aspect of this
issue, right? The verification of state over in timeout monitor before acting is another aspect?
Yes stack.. we have covered both these aspects and also the points told by JD.  Taking action
on timeout immediately and a mechanism for both master and RS to know what happened as part
of timeout and who ever wins the race succeeds.  

bq.(I believe it acts a little differently from 0.90 because of recent work done in here).

Reg timeout monitor the one major change is now the CLSOING state node is created by master
itself and it was done by RS as in 0.90.  Apart from this i dint find any big difference till
now. As part of HBASE-4083 we have introduced the return types from Open RegionHandler which
takes care of scenarios where a race condition happens between the master changes to RE_ALLOCATE
by the time the RS has moved to OPENED.



> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
>                 Key: HBASE-4015
>                 URL: https://issues.apache.org/jira/browse/HBASE-4015
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4015_1_trunk.patch, Timeoutmonitor with state diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition generator,
mostly making things worse rather than better. It does it's own thing for a while without
caring for what's happening in the rest of the master.
> The first thing that needs to happen is that the regions should not be processed in one
big batch, because that sometimes can take minutes to process (meanwhile a region that timed
out opening might have opened, then what happens is it will be reassigned by the TimeoutMonitor
generating the never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure how to do
it in a scalable way in this case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message