activemq-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gregor Stephen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
Date Fri, 18 Dec 2015 15:36:46 GMT

    [ https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15064099#comment-15064099
] 

Gregor Stephen commented on AMQ-5082:
-------------------------------------

We are seeing something very similar to this in our development environment.

We have a 3-node ActiveMQ cluster where each node has ActiveMQ 5.12.0 and Zookeeper 3.4.6
(*note, we have done some testing with Zookeeper 3.4.7, but this has failed to resolve the
issue. Time constraints have so far prevented us from testing ActiveMQ 5.13).

What we have found is that when we stop the master ZooKeeper process (via the "end process
tree" command in Task Manager), the remaining two ZooKeeper nodes continue to function as
normal. Sometimes the ActiveMQ cluster is able to handle this, but sometimes it does not.

When the cluster fails, we typically see this in the ActiveMQ log:

2015-12-18 09:08:45,157 | WARN  | Too many cluster members are connected.  Expected at most
3 members but there are 4 connected. | org.apache.activemq.leveldb.replicated.MasterElector
| WrapperSimpleAppMain-EventThread
...
...
2015-12-18 09:27:09,722 | WARN  | Session 0x351b43b4a560016 for server null, unexpected error,
closing socket connection and attempting reconnect | org.apache.zookeeper.ClientCnxn | WrapperSimpleAppMain-SendThread(192.168.0.10:2181)
java.net.ConnectException: Connection refused: no further information
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)[:1.7.0_79]
	at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)[:1.7.0_79]
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)[zookeeper-3.4.6.jar:3.4.6-1569965]
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)[zookeeper-3.4.6.jar:3.4.6-1569965]
	
We were immediately concerned by the fact that (A)ActiveMQ seems to think there are four members
in the cluster when it is only configured with 3 and (B) when the exception is raised, the
server appears to be null. We then increased ActiveMQ's logging level to DEBUG in order to
display the list of members:

2015-12-18 09:33:04,236 | DEBUG | ZooKeeper group changed: Map(localhost -> ListBuffer((0000000156,{"id":"localhost","container":null,"address":null,"position":-1,"weight":5,"elected":null}),
(0000000157,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":null}),
(0000000158,{"id":"localhost","container":null,"address":"tcp://192.168.0.11:61619","position":-1,"weight":10,"elected":null}),
(0000000159,{"id":"localhost","container":null,"address":null,"position":-1,"weight":10,"elected":null})))
| org.apache.activemq.leveldb.replicated.MasterElector | ActiveMQ BrokerService[localhost]
Task-14

> ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
> -------------------------------------------------------------------
>
>                 Key: AMQ-5082
>                 URL: https://issues.apache.org/jira/browse/AMQ-5082
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: activemq-leveldb-store
>    Affects Versions: 5.9.0, 5.10.0
>            Reporter: Scott Feldstein
>            Assignee: Christian Posta
>            Priority: Critical
>             Fix For: 5.14.0
>
>         Attachments: 03-07.tgz, amq_5082_threads.tar.gz, mq-node1-cluster.failure, mq-node2-cluster.failure,
mq-node3-cluster.failure, zookeeper.out-cluster.failure
>
>
> I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB persistence
adapter.
> {code}
>         <persistenceAdapter>
>             <replicatedLevelDB
>               directory="${activemq.data}/leveldb"
>               replicas="3"
>               bind="tcp://0.0.0.0:0"
>               zkAddress="zookeep0:2181"
>               zkPath="/activemq/leveldb-stores"/>
>         </persistenceAdapter>
> {code}
> After about a day or so of sitting idle there are cascading failures and the cluster
completely stops listening all together.
> I can reproduce this consistently on 5.9 and the latest 5.10 (commit 2360fb859694bacac1e48092e53a56b388e1d2f0).
 I am going to attach logs from the three mq nodes and the zookeeper logs that reflect the
time where the cluster starts having issues.
> The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds).
> The OSs are all centos 5.9 on one esx server, so I doubt networking is an issue.
> If you need more data it should be pretty easy to get whatever is needed since it is
consistently reproducible.
> This bug may be related to AMQ-5026, but looks different enough to file a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message