hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-19515) Region server left in online servers list forever if it went down after registering to master and before creating ephemeral node
Date Thu, 14 Dec 2017 19:14:00 GMT
stack created HBASE-19515:

             Summary: Region server left in online servers list forever if it went down after
registering to master and before creating ephemeral node
                 Key: HBASE-19515
                 URL: https://issues.apache.org/jira/browse/HBASE-19515
             Project: HBase
          Issue Type: Bug
          Components: Region Assignment
            Reporter: stack
            Priority: Critical
             Fix For: 2.0.0

This one is interesting. It was supposedly fixed long time ago back in HBASE-9593 (The issue
has same subject as this one) but there was a problem w/ the fix reported later, post-commit,
long after the issue was closed. The 'fix' was registering ephemeral node in ZK BEFORE reporting
in to the Master for the first time. The problem w/ this approach is that the Master tells
the RS what name it should use reporting in. If we register in ZK before we talk to the Master,
the name in ZK and the one the RS ends up using could deviate.

In hbase2, we do the right thing registering the ephemeral node after we report to the Master.
So, the issue reported in HBASE-9593, that a RS that dies between reporting to master and
registering up in ZK, stays registered at the Master for ever is back; we'll keep trying to
assign it regions. Its a real problem.

That hbase2 has this issue has been suppressed up until now. The test that was written for
HBASE-9593, TestRSKilledWhenInitializing, is a good test but a little sloppy. It puts up two
RSs aborting one only after registering at the Master before posting to ZK. That leaves one
healthy server up. It is hosting hbase:meta. This is enough for the test to bluster through.
The only assign it does is namespace table. It goes to the hbase:meta server. If the test
created a new table and did roundrobin, it'd fail.

After HBASE-18946, where we do round robin on table create -- a desirable attribute -- via
the balancer so all is kosher, the test TestRSKilledWhenInitializing now starts to fail because
we chose the hobbled server most of the time.

So, this issue is about fixing the original issue properly for hbase2. We don't have a timeout
on assign in AMv2, not yet, that might be the fix, or perhaps a double report before we online
a server with the second report coming in after ZK goes up (or we stop doing ephemeral nodes
for RS up in ZK and just rely on heartbeats....).

Making this a critical issue.

This message was sent by Atlassian JIRA

View raw message