Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 08DA411DE8 for ; Wed, 21 May 2014 02:01:42 +0000 (UTC) Received: (qmail 78799 invoked by uid 500); 21 May 2014 02:01:41 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 78712 invoked by uid 500); 21 May 2014 02:01:41 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 78703 invoked by uid 99); 21 May 2014 02:01:41 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 May 2014 02:01:41 +0000 Date: Wed, 21 May 2014 02:01:41 +0000 (UTC) From: "Enis Soztutar (JIRA)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-11217) Race between SplitLogManager task creation + TimeoutMonitor MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Enis Soztutar created HBASE-11217: ------------------------------------- Summary: Race between SplitLogManager task creation + TimeoutMonitor Key: HBASE-11217 URL: https://issues.apache.org/jira/browse/HBASE-11217 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Critical Fix For: 0.99.0 Some time ago, we reported a test failure in HBASE-11036, which resulted in already-split and merged regions coming back to life, causing split brain for region boundaries and resulting in data loss. It turns out that the root cause was not concurrent online schema change + region split/merge, but meta log splitting failing and the meta updates getting lost. This in turn causes the region split/merge information and assignment to be lost causing large scale data loss. Logs below shows that the split task for meta log is created, but before the znode is created, the timeout thread kicks in and sees the unassigned task. Then it does a get on znode which fails with NoNode (because the znode is not created yet). This causes the task to be marked complete (setDone(path, SUCCESS)) which means that the logs are lost. Meta is assigned elsewhere (and opened with the same seqId as previous) confirming data loss in meta. {code} 2014-04-16 18:31:26,267 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] handler.MetaServerShutdownHandler: Splitting hbase:meta logs for hor13n03.gq1.ygridcore.net,60020,1397672668647 2014-04-16 18:31:26,274 DEBUG [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.MasterFileSystem: Renamed region directory: hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/WALs/hor13n03.gq1.ygridcore.net,60020,1397672668647-splitting 2014-04-16 18:31:26,274 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: dead splitlog workers [hor13n03.gq1.ygridcore.net,60020,1397672668647] 2014-04-16 18:31:26,276 DEBUG [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: Scheduling batch of logs to split 2014-04-16 18:31:26,276 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: started splitting 1 logs in [hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/WALs/hor13n03.gq1.ygridcore.net,60020,1397672668647-splitting] 2014-04-16 18:31:26,276 INFO [hor13n02.gq1.ygridcore.net,60000,1397672191204.splitLogManagerTimeoutMonitor] master.SplitLogManager: total tasks = 1 unassigned = 1 tasks={/hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta=last_update = -1 last_version = - 2014-04-16 18:31:26,276 DEBUG [hor13n02.gq1.ygridcore.net,60000,1397672191204.splitLogManagerTimeoutMonitor] master.SplitLogManager: resubmitting unassigned task(s) after timeout 2014-04-16 18:31:26,277 WARN [main-EventThread] master.SplitLogManager$GetDataAsyncCallback: task znode /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta vanished. 2014-04-16 18:31:26,277 INFO [main-EventThread] master.SplitLogManager: Done splitting /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta 2014-04-16 18:31:26,282 DEBUG [main-EventThread] master.SplitLogManager: put up splitlog task at znode /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta 2014-04-16 18:31:26,286 WARN [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: returning success without actually splitting and deleting all the log files in path hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/WALs/hor13n03.gq1.ygridcore.net,60020,1397672668647-splitting 2014-04-16 18:31:26,286 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.SplitLogManager: finished splitting (more than or equal to) 9 bytes in 1 log files in [hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/WALs/hor13n03.gq1.ygridcore.net,60020,1397672668647-splitting] in 10ms 2014-04-16 18:31:26,290 DEBUG [main-EventThread] master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta 2014-04-16 18:31:26,290 WARN [main-EventThread] master.SplitLogManager$GetDataAsyncCallback: task znode /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta vanished. 2014-04-16 18:31:26,290 DEBUG [main-EventThread] master.SplitLogManager: unacquired orphan task is done /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta 2014-04-16 18:31:26,291 INFO [main-EventThread] master.SplitLogManager: task /hbase/splitWAL/RESCAN0000000383 entered state: DONE hor13n02.gq1.ygridcore.net,60000,1397672191204 2014-04-16 18:31:26,291 DEBUG [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.AssignmentManager: based on AM, current region=hbase:meta,,1.1588230740 is on server=hor13n03.gq1.ygridcore.net,60020,1397672668647 server being checked: hor13n03.gq1.ygridcore.net,60020,1397672668647 2014-04-16 18:31:26,291 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] handler.MetaServerShutdownHandler: Server hor13n03.gq1.ygridcore.net,60020,1397672668647 was carrying META. Trying to assign. 2014-04-16 18:31:26,291 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.RegionStates: Transitioned {1588230740 state=OPEN, ts=1397672681933, server=hor13n03.gq1.ygridcore.net,60020,1397672668647} to {1588230740 state=OFFLINE, ts=1397673086291, server=hor13n03.gq1.ygridcore.net,60020,1397672668647} 2014-04-16 18:31:26,291 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.RegionStates: Offlined 1588230740 from hor13n03.gq1.ygridcore.net,60020,1397672668647 2014-04-16 18:31:26,299 INFO [main-EventThread] master.SplitLogManager$DeleteAsyncCallback: /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta does not exist. Either was created but deleted behind our back by another pending delete OR was deleted in earl 2014-04-16 18:31:26,299 DEBUG [main-EventThread] master.SplitLogManager: deleted task without in memory state /hbase/splitWAL/WALs%2Fhor13n03.gq1.ygridcore.net%2C60020%2C1397672668647-splitting%2Fhor13n03.gq1.ygridcore.net%252C60020%252C1397672668647.1397672681632.meta 2014-04-16 18:31:26,299 DEBUG [main-EventThread] master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitWAL/RESCAN0000000383 2014-04-16 18:31:26,299 DEBUG [main-EventThread] master.SplitLogManager: deleted task without in memory state /hbase/splitWAL/RESCAN0000000383 2014-04-16 18:31:26,301 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] catalog.CatalogTracker: Failed verification of hbase:meta,,1 at address=hor13n03.gq1.ygridcore.net,60020,1397672668647, exception=java.net.ConnectException: Connection refused 2014-04-16 18:31:26,301 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] zookeeper.ZooKeeperNodeTracker: Unsetting hbase:meta region location in ZooKeeper 2014-04-16 18:31:26,315 DEBUG [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.AssignmentManager: No previous transition plan found (or ignoring an existing plan) for hbase:meta,,1.1588230740; generated random plan=hri=hbase:meta,,1.1588230740, src=, dest=hor13n04.gq1.ygridcore.net,60020,1397672685370; 3 (online=3, available=3) available servers, forceNew 2014-04-16 18:31:26,315 DEBUG [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] zookeeper.ZKAssign: master:60000-0x3456a48dd7d0223, quorum=hor13n04.gq1.ygridcore.net:2181,hor13n03.gq1.ygridcore.net:2181,hor13n20.gq1.ygridcore.net:2181, baseZNode=/hbase Creating (or updating) unassigned node 1588230740 with OFFLINE state 2014-04-16 18:31:26,323 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to hor13n04.gq1.ygridcore.net,60020,1397672685370 2014-04-16 18:31:26,323 INFO [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.RegionStates: Transitioned {1588230740 state=OFFLINE, ts=1397673086315, server=hor13n03.gq1.ygridcore.net,60020,1397672668647} to {1588230740 state=PENDING_OPEN, ts=1397673086323, server=hor13n04.gq1.ygridcore.net,60020,1397672685370} 2014-04-16 18:31:28,337 DEBUG [MASTER_META_SERVER_OPERATIONS-hor13n02:60000-2] master.DeadServer: Finished processing hor13n03.gq1.ygridcore.net,60020,1397672668647 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)