activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Feldstein (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
Date Fri, 11 Apr 2014 20:38:22 GMT

    [ https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967073#comment-13967073
] 

Scott Feldstein commented on AMQ-5082:
--------------------------------------

This can be reproduced by causing a network disruption where all the apache nodes can't access
any of the zookeeper cluster.  The disruption needs to be longer than the zk timeout.  The
quickest way I've been able to accomplish this is by using my local firewall, although the
state of the objects isn't always the exact same as the logs that i posted the overall symptoms
are.

In code I think the problem is in the org.apache.activemq.leveldb.replicated.MasterElector.scala
-> change_listener.  As Kevin said, the node thinks that it is the master but it is not
correctly initializing the listener when it gets into this state.

> ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
> -------------------------------------------------------------------
>
>                 Key: AMQ-5082
>                 URL: https://issues.apache.org/jira/browse/AMQ-5082
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: activemq-leveldb-store
>    Affects Versions: 5.9.0, 5.10.0
>            Reporter: Scott Feldstein
>            Priority: Critical
>         Attachments: 03-07.tgz, amq_5082_threads.tar.gz, mq-node1-cluster.failure, mq-node2-cluster.failure,
mq-node3-cluster.failure, zookeeper.out-cluster.failure
>
>
> I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB persistence
adapter.
> {code}
>         <persistenceAdapter>
>             <replicatedLevelDB
>               directory="${activemq.data}/leveldb"
>               replicas="3"
>               bind="tcp://0.0.0.0:0"
>               zkAddress="zookeep0:2181"
>               zkPath="/activemq/leveldb-stores"/>
>         </persistenceAdapter>
> {code}
> After about a day or so of sitting idle there are cascading failures and the cluster
completely stops listening all together.
> I can reproduce this consistently on 5.9 and the latest 5.10 (commit 2360fb859694bacac1e48092e53a56b388e1d2f0).
 I am going to attach logs from the three mq nodes and the zookeeper logs that reflect the
time where the cluster starts having issues.
> The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds).
> The OSs are all centos 5.9 on one esx server, so I doubt networking is an issue.
> If you need more data it should be pretty easy to get whatever is needed since it is
consistently reproducible.
> This bug may be related to AMQ-5026, but looks different enough to file a separate issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message