activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Feldstein (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AMQ-5082) ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
Date Sat, 08 Mar 2014 03:14:44 GMT

     [ https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Scott Feldstein updated AMQ-5082:
---------------------------------

    Attachment: 03-07.tgz

I've been doing some more digging on this.  It turns out that there is a job on another vm
on the esx box that occurs exactly when things fail.  The job is backing up @ 60GB s3.  At
that point there is a big hiccup in the system and all the activemq nodes stop listening and
never recover.  I tried to increase the zookeeper timeout from 2s to 10s on the activemq nodes
and that still doesn't help.

I think the bug here is that the nodes never recover from this outage.  It seems like there
should be some type of recovery logic when all the nodes lose their connection and come back
to life at some later point in time.

I am adding more logs with DEBUG enabled from 03/07/14.

This set of logs shows the behavior that I am talking about.  Starting at 04:52:22,366 the
nodes go into a state where none are elected

{code}
$ egrep '"elected":' mq-node1-activemq.log.bug 
2014-03-07 04:50:49,325 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000169,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":"0000000171"}),
(0000000171,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:58506","position":-1,"weight":1,"elected":null})))
2014-03-07 04:50:54,453 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000169,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":"0000000171"}),
(0000000171,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:58506","position":-1,"weight":1,"elected":null}),
(0000000173,{"id":"localhost","container":null,"address":null,"position":161531,"weight":1,"elected":null})))
2014-03-07 04:51:06,008 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000169,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":"0000000171"}),
(0000000171,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:58506","position":-1,"weight":1,"elected":null})))
2014-03-07 04:51:42,386 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000169,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":"0000000171"}),
(0000000171,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:52:22,366 DEBUG [main] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null})))
2014-03-07 04:52:24,160 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:52:24,167 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000176,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:52:58,259 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000176,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000177,{"id":"localhost","container":null,"address":null,"position":161531,"weight":1,"elected":null})))
2014-03-07 04:55:12,679 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000176,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:55:12,681 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:56:02,014 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000178,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 05:00:20,550 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 05:01:32,446 DEBUG [main-EventThread] [org.apache.activemq.leveldb.replicated.MasterElector@112]
ZooKeeper group changed: Map(localhost -> ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000179,{"id":"localhost","container":null,"address":null,"position":161531,"weight":1,"elected":null})))
{code}

> ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
> -------------------------------------------------------------------
>
>                 Key: AMQ-5082
>                 URL: https://issues.apache.org/jira/browse/AMQ-5082
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: activemq-leveldb-store
>    Affects Versions: 5.9.0, 5.10.0
>            Reporter: Scott Feldstein
>            Priority: Critical
>         Attachments: 03-07.tgz, mq-node1-cluster.failure, mq-node2-cluster.failure, mq-node3-cluster.failure,
zookeeper.out-cluster.failure
>
>
> I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB persistence
adapter.
> {code}
>         <persistenceAdapter>
>             <replicatedLevelDB
>               directory="${activemq.data}/leveldb"
>               replicas="3"
>               bind="tcp://0.0.0.0:0"
>               zkAddress="zookeep0:2181"
>               zkPath="/activemq/leveldb-stores"/>
>         </persistenceAdapter>
> {code}
> After about a day or so of sitting idle there are cascading failures and the cluster
completely stops listening all together.
> I can reproduce this consistently on 5.9 and the latest 5.10 (commit 2360fb859694bacac1e48092e53a56b388e1d2f0).
 I am going to attach logs from the three mq nodes and the zookeeper logs that reflect the
time where the cluster starts having issues.
> The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds).
> The OSs are all centos 5.9 on one esx server, so I doubt networking is an issue.
> If you need more data it should be pretty easy to get whatever is needed since it is
consistently reproducible.
> This bug may be related to AMQ-5026, but looks different enough to file a separate issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message