hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10482) ReplicationSyncUp doesn't clean up its ZK, needed for tests
Date Mon, 10 Feb 2014 23:38:19 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897285#comment-13897285
] 

Jean-Daniel Cryans commented on HBASE-10482:
--------------------------------------------

[~lhofhansl] I just pushed it to 0.94.

bq. Or are you just debugging in 0.96+ for now?

Nothing like that, I think Stack just committed it to 0.96 and above since the patch I posted
could directly be applied to those branches.

> ReplicationSyncUp doesn't clean up its ZK, needed for tests
> -----------------------------------------------------------
>
>                 Key: HBASE-10482
>                 URL: https://issues.apache.org/jira/browse/HBASE-10482
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.96.1, 0.94.16
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.98.1, 0.99.0, 0.94.17
>
>         Attachments: HBASE-10249.patch
>
>
> TestReplicationSyncUpTool failed again:
> https://builds.apache.org/job/HBase-TRUNK/4895/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationSyncUpTool/testSyncUpTool/
> It's not super obvious why only one of the two tables is replicated, the test could use
some more logging, but I understand it this way:
> The first ReplicationSyncUp gets started and for some reason it cannot replicate the
data:
> {noformat}
> 2014-02-06 21:32:19,811 INFO  [Thread-1372] regionserver.ReplicationSourceManager(203):
Current list of replicators: [1391722339091.SyncUpTool.replication.org,1234,1, quirinus.apache.org,37045,1391722237951,
quirinus.apache.org,33502,1391722238125] other RSs: []
> 2014-02-06 21:32:19,811 INFO  [Thread-1372.replicationSource,1] regionserver.ReplicationSource(231):
Replicating db42e7fc-7f29-4038-9292-d85ea8b9994b -> 783c0ab2-4ff9-4dc0-bb38-86bf31d1d817
> 2014-02-06 21:32:19,892 TRACE [Thread-1372.replicationSource,2] regionserver.ReplicationSource(596):
No log to process, sleeping 100 times 1
> 2014-02-06 21:32:19,911 TRACE [Thread-1372.replicationSource,1] regionserver.ReplicationSource(596):
No log to process, sleeping 100 times 1
> 2014-02-06 21:32:20,094 TRACE [Thread-1372.replicationSource,2] regionserver.ReplicationSource(596):
No log to process, sleeping 100 times 2
> ...
> 2014-02-06 21:32:23,414 TRACE [Thread-1372.replicationSource,1] regionserver.ReplicationSource(596):
No log to process, sleeping 100 times 8
> 2014-02-06 21:32:23,673 INFO  [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl(169):
Moving quirinus.apache.org,37045,1391722237951's hlogs to my queue
> 2014-02-06 21:32:23,768 DEBUG [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl(396):
Creating quirinus.apache.org%2C37045%2C1391722237951.1391722243779 with data 10803
> 2014-02-06 21:32:23,842 DEBUG [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl(396):
Creating quirinus.apache.org%2C37045%2C1391722237951.1391722243779 with data 10803
> 2014-02-06 21:32:24,297 TRACE [Thread-1372.replicationSource,2] regionserver.ReplicationSource(596):
No log to process, sleeping 100 times 9
> 2014-02-06 21:32:24,314 TRACE [Thread-1372.replicationSource,1] regionserver.ReplicationSource(596):
No log to process, sleeping 100 times 9
> {noformat}
> Finally it gives up:
> {noformat}
> 2014-02-06 21:32:30,873 DEBUG [Thread-1372] replication.TestReplicationSyncUpTool(323):
SyncUpAfterDelete failed at retry = 0, with rowCount_ht1TargetPeer1 =100 and rowCount_ht2TargetAtPeer1
=200
> {noformat}
> The syncUp tool has an ID you can follow, grep for syncupReplication1391722338885 or
just the timestamp, and you can see it doing things after that. The reason is that the tool
closes the ReplicationSourceManager but not the ZK connection, so events _still_ come in and
NodeFailoverWorker _still_ tries to recover queues but then there's nothing to process them.
> Later in the logs you can see:
> {noformat}
> 2014-02-06 21:32:37,381 INFO  [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl(169):
Moving quirinus.apache.org,33502,1391722238125's hlogs to my queue
> 2014-02-06 21:32:37,567 INFO  [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl(239):
Won't transfer the queue, another RS took care of it because of: KeeperErrorCode = NoNode
for /1/replication/rs/quirinus.apache.org,33502,1391722238125/lock
> {noformat}
> There shouldn't' be any racing, but now someone already moved "quirinus.apache.org,33502,1391722238125"
away.
> FWIW I can't even make the test fail on my machine so I'm not 100% sure closing the ZK
connection fixes the issue, but at least it's the right thing to do.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message