hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14012) Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover
Date Thu, 02 Jul 2015 20:10:04 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612446#comment-14612446
] 

stack commented on HBASE-14012:
-------------------------------

Here is a bit of log:

{code}
2015-06-09 20:06:20,270 INFO  [c2020:16000.activeMasterManager] master.ServerManager: AssignmentManager
hasn't finished failover cleanup; waiting
2015-06-09 20:06:20,272 INFO  [c2020:16000.activeMasterManager] master.HMaster: hbase:meta
with replicaId 0 assigned=0, rit=false, location=c2025.halxg.cloudera.com,16020,1433892619022
2015-06-09 20:06:20,295 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/c607a47967fd4873135f38e883156e4d/big
2015-06-09 20:06:20,295 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/1a5a90047a76da6dddebb5aff0acb275/big
2015-06-09 20:06:20,342 DEBUG [hconnection-0x680c3bc0-shared--pool3-t1] ipc.RpcClientImpl:
Use SIMPLE authentication for service ClientService, sasl=false
2015-06-09 20:06:20,342 DEBUG [hconnection-0x680c3bc0-shared--pool3-t1] ipc.RpcClientImpl:
Connecting to c2025.halxg.cloudera.com/10.20.84.31:16020
2015-06-09 20:06:20,376 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/c607a47967fd4873135f38e883156e4d/tiny
2015-06-09 20:06:20,379 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/1a5a90047a76da6dddebb5aff0acb275/tiny
2015-06-09 20:06:20,383 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/d586e9037f683384411ab2663e31f97b/big
2015-06-09 20:06:20,383 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/ce4ebb9a375a1fe4b5777d2d960c940c/big
2015-06-09 20:06:20,420 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/d586e9037f683384411ab2663e31f97b/tiny
2015-06-09 20:06:20,421 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/6dc837d3ec4e2afd05314472ee17ca80/big
2015-06-09 20:06:20,422 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/ce4ebb9a375a1fe4b5777d2d960c940c/tiny
2015-06-09 20:06:20,423 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/6fbe22ff15c2e5f2b207f79eaf8f382a/big
2015-06-09 20:06:20,453 DEBUG [ProcedureExecutorThread-10] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/6fbe22ff15c2e5f2b207f79eaf8f382a/tiny

...

2015-06-09 20:06:20,795 DEBUG [ProcedureExecutorThread-4] regionserver.HRegionFileSystem:
No StoreFiles for: hdfs://c2020.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/0983b02ec079ea8ac2fb2901dbe2a6fb/tiny
2015-06-09 20:06:20,797 INFO  [ProcedureExecutorThread-4] master.AssignmentManager: Bulk assigning
9 region(s) across 5 server(s), round-robin=true
....
2015-06-09 20:06:20,909 INFO  [c2020:16000.activeMasterManager] master.AssignmentManager:
Found regions out on cluster or in RIT; presuming failover
{code}

Its the bulk assign there on the end that is doing assign of regions already out on cluster.

> Double Assignment and Dataloss when ServerCrashProcedure runs during Master failover
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-14012
>                 URL: https://issues.apache.org/jira/browse/HBASE-14012
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment
>    Affects Versions: 2.0.0, 1.2.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>
> ITBLL. Master comes up. It is joining a running cluster (all servers up except Master
with most regions assigned out on cluster). ProcedureStore has two ServerCrashProcedures unfinished
(RUNNABLE state). In SCP, we only check if failover in first step, not for every step, which
means ServerCrashProcedure will run if on reload it is beyond the first step.
> {code}
>     // Is master fully online? If not, yield. No processing of servers unless master
is up
>     if (!services.getAssignmentManager().isFailoverCleanupDone()) {
>       throwProcedureYieldException("Waiting on master failover to complete");
>     }
> {code}
> There is no definitive logging but it looks like we start running at the assign step.
The regions to assign were persisted before master crash. The regions to assign may not make
sense post crash: i.e. here we double-assign. Checking. We shouldn't run until master is fully
up regardless.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message