hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12865) WALs may be deleted before they are replicated to peers
Date Wed, 01 Jul 2015 14:11:05 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610294#comment-14610294
] 

Hadoop QA commented on HBASE-12865:
-----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12743047/HBASE-12865-V2.diff
  against master branch at commit 85c278a6a8b25ff86e22c254ffec35e945cd7c66.
  ATTACHMENT ID: 12743047

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 6 new or modified
tests.

    {color:green}+1 hadoop versions{color}. The patch compiles with all supported hadoop versions
(2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.0 2.7.0)

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 protoc{color}.  The applied patch does not increase the total number of
protoc compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any warning messages.

    {color:green}+1 checkstyle{color}.  The applied patch does not increase the total number
of checkstyle errors

    {color:green}+1 findbugs{color}.  The patch does not introduce any  new Findbugs (version
2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:green}+1 lineLengths{color}.  The patch does not introduce lines longer than 100

  {color:green}+1 site{color}.  The mvn post-site goal succeeds with this patch.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.util.TestProcessBasedCluster
                  org.apache.hadoop.hbase.mapreduce.TestImportExport
                  org.apache.hadoop.hbase.TestRegionRebalancing

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/14639//testReport/
Release Findbugs (version 2.0.3) 	warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/14639//artifact/patchprocess/newFindbugsWarnings.html
Checkstyle Errors: https://builds.apache.org/job/PreCommit-HBASE-Build/14639//artifact/patchprocess/checkstyle-aggregate.html

  Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/14639//console

This message is automatically generated.

> WALs may be deleted before they are replicated to peers
> -------------------------------------------------------
>
>                 Key: HBASE-12865
>                 URL: https://issues.apache.org/jira/browse/HBASE-12865
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Liu Shaohui
>            Assignee: He Liangliang
>            Priority: Critical
>         Attachments: HBASE-12865-V1.diff, HBASE-12865-V2.diff
>
>
> By design, ReplicationLogCleaner guarantee that the WALs  being in replication queue
can't been deleted by the HMaster. The ReplicationLogCleaner gets the WAL set from zookeeper
by scanning the replication zk node. But it may get uncompleted WAL set during replication
failover for the scan operation is not atomic.
> For example: There are three region servers: rs1, rs2, rs3, and peer id 10.  The layout
of replication zookeeper nodes is:
> {code}
> /hbase/replication/rs/rs1/10/wals
>                      /rs2/10/wals
>                      /rs3/10/wals
> {code}
> - t1: the ReplicationLogCleaner finished scanning the replication queue of rs1, and start
to scan the queue of rs2.
> - t2: region server rs3 is down, and rs1 take over rs3's replication queue. The new layout
is
> {code}
> /hbase/replication/rs/rs1/10/wals
>                      /rs1/10-rs3/wals
>                      /rs2/10/wals
>                      /rs3
> {code}
> - t3, the ReplicationLogCleaner finished scanning the queue of rs2, and start to scan
the node of rs3. But the the queue has been moved to  "replication/rs1/10-rs3/WALS"
> So the  ReplicationLogCleaner will miss the WALs of rs3 in peer 10 and the hmaster may
delete these WALs before they are replicated to peer clusters.
> We encountered this problem in our cluster and I think it's a serious bug for replication.
> Suggestions are welcomed to fix this bug. thx~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message