hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Umesh Agashe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-18261) [AMv2] Create new RecoverMetaProcedure and use it from ServerCrashProcedure and HMaster.finishActiveMasterInitialization()
Date Mon, 26 Jun 2017 22:38:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063911#comment-16063911
] 

Umesh Agashe edited comment on HBASE-18261 at 6/26/17 10:37 PM:
----------------------------------------------------------------

Looking more deep into the code, I found that Meta recovery from
* ServerCrashProcedure and
* HMaster.finishActiveMasterInitialization()

is synchronized on event serverCrashProcessingEnabled in HMaster and not on event metaInitEvent
in AssignmentManager.

Every step in ServerCrashProcedure waits on event serverCrashProcessingEnabled making sure
meta is recovered before proceeding further. There is a bug in the test framework, patch HBASE-18261.master.001.patch
fixes it by resetting serverCrashProcessingEnabled event to false in simulated master stop().
Restarting HMaster will set it back to true.

I've verified the fix by running the test 10 times in a loop in my dev environment. Without
fix test fails 4/20 times. With fix it passes all 20 times.

There is still a problem with the code, meta will be reassigned by HMaster.finishActiveMasterInitialization()
when assignMeta() is called but when ServerCrashProcedure is resumed after restart it will
try to process and reassign Meta based on in-memory state of meta when meta is not loaded
fully.

De-deuplicating this logic and creating RecoverMetaProcedure is long term solution that improves
code readability. For now, patch HBASE-18261.master.001.patch fixes and enables test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta().

Going forward:
* We can keep this JIRA open for tracking RecoverMetaProcedure work and reduce the priority
(improvements) and create a new JIRA for test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta()
fix.
* Re-purpose this JIRA by changing the title to enable unit test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta()

Let me know your thoughts. Thanks!


was (Author: uagashe):
Looking more deep into the code, I found that Meta recovery from
* ServerCrashProcedure and
* HMaster.finishActiveMasterInitialization()
is synchronized on event serverCrashProcessingEnabled in HMaster and not on event metaInitEvent
in AssignmentManager.

Every step in ServerCrashProcedure waits on event serverCrashProcessingEnabled making sure
meta is recovered before proceeding further. There is a bug in the test framework, patch HBASE-18261.master.001.patch
fixes it by resetting serverCrashProcessingEnabled event to false in simulated master stop().
Restarting HMaster will set it back to true.

I've verified the fix by running the test 10 times in a loop in my dev environment. Without
fix test fails 4/20 times. With fix it passes all 20 times.

There is still a problem with the code, meta will be reassigned by HMaster.finishActiveMasterInitialization()
when assignMeta() is called but when ServerCrashProcedure is resumed after restart it will
try to process and reassign Meta based on in-memory state of meta when meta is not loaded
fully.

De-deuplicating this logic and creating RecoverMetaProcedure is long term solution that improves
code readability. For now, patch HBASE-18261.master.001.patch fixes and enables test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta().

Going forward:
* We can keep this JIRA open for tracking RecoverMetaProcedure work and reduce the priority
(improvements) and create a new JIRA for test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta()
fix.
* Re-purpose this JIRA by changing the title to enable unit test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta()

Let me know your thoughts. Thanks!

> [AMv2] Create new RecoverMetaProcedure and use it from ServerCrashProcedure and HMaster.finishActiveMasterInitialization()
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-18261
>                 URL: https://issues.apache.org/jira/browse/HBASE-18261
>             Project: HBase
>          Issue Type: Improvement
>          Components: amv2
>    Affects Versions: 2.0.0-alpha-1
>            Reporter: Umesh Agashe
>            Assignee: Umesh Agashe
>             Fix For: 2.0.0-alpha-2
>
>         Attachments: HBASE-18261.master.001.patch
>
>
> When unit test hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta()
is enabled and run several times, it fails intermittently. Cause is meta recovery is done
at two different places:
> * ServerCrashProcedure.processMeta()
> * HMaster.finishActiveMasterInitialization()
> and its not coordinated.
> When HMaster.finishActiveMasterInitialization() gets to submit splitMetaLog() first and
while its running call from ServerCrashProcedure.processMeta() fails causing step to be retried
again in a loop.
> When ServerCrashProcedure.processMeta() submits splitMetaLog after splitMetaLog from
HMaster.finishActiveMasterInitialization() is finished, success is returned without doing
any work.
> But if ServerCrashProcedure.processMeta() submits splitMetaLog request and while its
going HMaster.finishActiveMasterInitialization() submits it test fails with exception.
> [~stack] and I discussed the possible solution:
> Create RecoverMetaProcedure and call it where required. Procedure framework provides
mutual exclusion and requires idempotence, which should fix the problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message