Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D74FE1884F for ; Sat, 8 Aug 2015 04:11:46 +0000 (UTC) Received: (qmail 5403 invoked by uid 500); 8 Aug 2015 04:11:46 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 5323 invoked by uid 500); 8 Aug 2015 04:11:46 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 5308 invoked by uid 99); 8 Aug 2015 04:11:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Aug 2015 04:11:46 +0000 Date: Sat, 8 Aug 2015 04:11:46 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-12865) WALs may be deleted before they are replicated to peers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662792#comment-14662792 ] Hudson commented on HBASE-12865: -------------------------------- FAILURE: Integrated in HBase-1.3-IT #77 (See [https://builds.apache.org/job/HBase-1.3-IT/77/]) HBASE-12865 WALs may be deleted before they are replicated to peers (He Liangliang) (apurtell: rev 68cb53d1512411e91c864b29da0a4f9fb1c3e69a) * hbase-server/src/main/java/org/apache/hadoop/hbase/replication/master/ReplicationLogCleaner.java * hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationQueuesClientZKImpl.java * hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationQueuesZKImpl.java * hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceManager.java * hbase-client/src/main/java/org/apache/hadoop/hbase/replication/ReplicationQueuesClient.java * hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationStateBasic.java > WALs may be deleted before they are replicated to peers > ------------------------------------------------------- > > Key: HBASE-12865 > URL: https://issues.apache.org/jira/browse/HBASE-12865 > Project: HBase > Issue Type: Bug > Components: Replication > Reporter: Liu Shaohui > Assignee: He Liangliang > Priority: Critical > Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2, 1.3.0 > > Attachments: HBASE-12865-V1.diff, HBASE-12865-V2.diff > > > By design, ReplicationLogCleaner guarantee that the WALs being in replication queue can't been deleted by the HMaster. The ReplicationLogCleaner gets the WAL set from zookeeper by scanning the replication zk node. But it may get uncompleted WAL set during replication failover for the scan operation is not atomic. > For example: There are three region servers: rs1, rs2, rs3, and peer id 10. The layout of replication zookeeper nodes is: > {code} > /hbase/replication/rs/rs1/10/wals > /rs2/10/wals > /rs3/10/wals > {code} > - t1: the ReplicationLogCleaner finished scanning the replication queue of rs1, and start to scan the queue of rs2. > - t2: region server rs3 is down, and rs1 take over rs3's replication queue. The new layout is > {code} > /hbase/replication/rs/rs1/10/wals > /rs1/10-rs3/wals > /rs2/10/wals > /rs3 > {code} > - t3, the ReplicationLogCleaner finished scanning the queue of rs2, and start to scan the node of rs3. But the the queue has been moved to "replication/rs1/10-rs3/WALS" > So the ReplicationLogCleaner will miss the WALs of rs3 in peer 10 and the hmaster may delete these WALs before they are replicated to peer clusters. > We encountered this problem in our cluster and I think it's a serious bug for replication. > Suggestions are welcomed to fix this bug. thx~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)