hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19515) Region server left in online servers list forever if it went down after registering to master and before creating ephemeral node
Date Fri, 15 Dec 2017 06:03:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292074#comment-16292074
] 

ramkrishna.s.vasudevan commented on HBASE-19515:
------------------------------------------------

bq.After HBASE-18946, where we do round robin on table create – a desirable attribute –
via the balancer so all is kosher, the test TestRSKilledWhenInitializing now starts to fail
because we chose the hobbled server most of the time
Even before HBASE-18946 this was happening the same way correct? The place where we do round
robin only changed? I have not digged in to this like how you have done it but just asking.
Your point may be right but just want to know.

> Region server left in online servers list forever if it went down after registering to
master and before creating ephemeral node
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-19515
>                 URL: https://issues.apache.org/jira/browse/HBASE-19515
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: stack
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> This one is interesting. It was supposedly fixed long time ago back in HBASE-9593 (The
issue has same subject as this one) but there was a problem w/ the fix reported later, post-commit,
long after the issue was closed. The 'fix' was registering ephemeral node in ZK BEFORE reporting
in to the Master for the first time. The problem w/ this approach is that the Master tells
the RS what name it should use reporting in. If we register in ZK before we talk to the Master,
the name in ZK and the one the RS ends up using could deviate.
> In hbase2, we do the right thing registering the ephemeral node after we report to the
Master. So, the issue reported in HBASE-9593, that a RS that dies between reporting to master
and registering up in ZK, stays registered at the Master for ever is back; we'll keep trying
to assign it regions. Its a real problem.
> That hbase2 has this issue has been suppressed up until now. The test that was written
for HBASE-9593, TestRSKilledWhenInitializing, is a good test but a little sloppy. It puts
up two RSs aborting one only after registering at the Master before posting to ZK. That leaves
one healthy server up. It is hosting hbase:meta. This is enough for the test to bluster through.
The only assign it does is namespace table. It goes to the hbase:meta server. If the test
created a new table and did roundrobin, it'd fail.
> After HBASE-18946, where we do round robin on table create -- a desirable attribute --
via the balancer so all is kosher, the test TestRSKilledWhenInitializing now starts to fail
because we chose the hobbled server most of the time.
> So, this issue is about fixing the original issue properly for hbase2. We don't have
a timeout on assign in AMv2, not yet, that might be the fix, or perhaps a double report before
we online a server with the second report coming in after ZK goes up (or we stop doing ephemeral
nodes for RS up in ZK and just rely on heartbeats....).
> Making this a critical issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message