Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 28A9BE049 for ; Fri, 28 Dec 2012 09:58:15 +0000 (UTC) Received: (qmail 49861 invoked by uid 500); 28 Dec 2012 09:58:13 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 49560 invoked by uid 500); 28 Dec 2012 09:58:13 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 49395 invoked by uid 99); 28 Dec 2012 09:58:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Dec 2012 09:58:13 +0000 Date: Fri, 28 Dec 2012 09:58:13 +0000 (UTC) From: "terry zhang (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-7451) [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happen MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-7451?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1354= 0418#comment-13540418 ]=20 terry zhang commented on HBASE-7451: ------------------------------------ My solution is simple, Just notify snapshotErrorListener to threw a excepti= on to let all GloballyConsistentRegionLockTask to quit and release the reg= ion lock. {code:title=3DSnapshotOperation.java|borderStyle=3Dsolid} @Override public void cleanup(Exception e) { LOG.debug("Cleanup snapshot - handled in sub-tasks on error"); + snapshotErrorListener.receiveError("Cleanup snapshot", new HBaseSnapsh= otException(e.getMessage())); } {code}=20 =20 > [snapshot] regionserver will be deadlock when GlobalSnapshotOperation tim= eout happen > -------------------------------------------------------------------------= ----------- > > Key: HBASE-7451 > URL: https://issues.apache.org/jira/browse/HBASE-7451 > Project: HBase > Issue Type: Bug > Components: snapshots > Reporter: terry zhang > Assignee: terry zhang > > Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in git= hub =EF=BC=9A https://github.com/matteobertozzi/hbase/ > If we create a snapshot and meet regionserver timeout. Rs will be lock an= d can not put any data. Please take a look at log below : > // regionserver snapshot timeout > org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitExcept= ion: org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemp= tTimeoutException: Timeout elapsed! Start:1356518666984, End:1356518667584,= diff:600, max:600 ms > at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhas= eCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java= :135) > at org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhas= eCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDis= patcher.java:71) > at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(= ThreePhaseCommit.java:92) > at org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(= ThreePhaseCommit.java:89) > at org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.r= un(OperationAttemptTimer.java:71) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > Caused by: org.apache.hadoop.hbase.server.errorhandling.exception.Operati= onAttemptTimeoutException: Timeout elapsed! Start:1356518666984, End:135651= 8667584, diff:600, max:600 ms > ... 3 more > 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPh= aseCommit: Running cleanup phase. > 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapsh= ot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on = error > 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv > //Waiting for 'commit allowed' latch and do not exist > 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.server.commit.TwoPh= aseCommit: Running cleanup phase. > 2012-12-26 18:44:57,211 DEBUG org.apache.hadoop.hbase.regionserver.snapsh= ot.operation.SnapshotOperation: Cleanup snapshot - handled in sub-tasks on = error > 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.server.commit.TwoPh= aseCommit: Running finish phase. > 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.regionserver.snapsh= ot.operation.SnapshotOperation: Finish snapshot - handling in subtasks on e= rror > 2012-12-26 18:44:57,212 WARN org.apache.hadoop.hbase.server.errorhandling= .OperationAttemptTimer: Timer already marked completed, ignoring! > 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:17,002 INFO org.apache.hadoop.hbase.server.commit.distri= buted.zookeeper.ZKTwoPhaseCommitCohortMemberController: Received children c= hanged event:/hbase-TERRY-73/online-snapshot/prepare > 2012-12-26 18:45:17,002 INFO org.apache.hadoop.hbase.server.commit.distri= buted.zookeeper.ZKTwoPhaseCommitCohortMemberController: Recieved start even= t. > 2012-12-26 18:45:17,002 DEBUG org.apache.hadoop.hbase.server.commit.distr= ibuted.zookeeper.ZKTwoPhaseCommitCohortMemberController: Looking for new op= erations under znode:/hbase-TERRY-73/online-snapshot/prepare > 2012-12-26 18:45:17,003 INFO org.apache.hadoop.hbase.server.commit.distri= buted.zookeeper.ZKTwoPhaseCommitCohortMemberController: Received children c= hanged event:/hbase-TERRY-73/online-snapshot/abort > 2012-12-26 18:45:17,003 INFO org.apache.hadoop.hbase.server.commit.distri= buted.zookeeper.ZKTwoPhaseCommitCohortMemberController: Recieved abort even= t. > 2012-12-26 18:45:17,003 DEBUG org.apache.hadoop.hbase.server.commit.distr= ibuted.zookeeper.ZKTwoPhaseCommitCohortMemberController: Checking for abort= ed operations on node:/hbase-TERRY-73/online-snapshot/abort > 2012-12-26 18:45:21,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:26,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:31,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:36,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:41,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:43,481 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCa= che: LRU Stats: total=3D11.77 MB, free=3D1.39 GB, max=3D1.4 GB, blocks=3D5,= accesses=3D96, hits=3D91, hitRatio=3D94.79%, cachingAccesses=3D96, caching= Hits=3D91, cachingHitsRatio=3D94.79%, evictions=3D0, evicted=3D0, evictedPe= rRun=3DNaN > 2012-12-26 18:45:46,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:51,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:45:56,993 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:46:01,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:46:06,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:46:11,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:46:16,994 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:46:21,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:46:26,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:46:31,995 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:46:36,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) > 2012-12-26 18:46:41,996 DEBUG org.apache.hadoop.hbase.util.Threads: Waiti= ng for 'commit allowed' latch. (sleep:5000 ms) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira