From dev-return-97615-archive-asf-public=cust-asf.ponee.io@qpid.apache.org Tue Oct 1 08:02:02 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 97906180608 for ; Tue, 1 Oct 2019 10:02:02 +0200 (CEST) Received: (qmail 96309 invoked by uid 500); 1 Oct 2019 08:02:01 -0000 Mailing-List: contact dev-help@qpid.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@qpid.apache.org Delivered-To: mailing list dev@qpid.apache.org Received: (qmail 96298 invoked by uid 99); 1 Oct 2019 08:02:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Oct 2019 08:02:01 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 85591E0101 for ; Tue, 1 Oct 2019 08:02:00 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 03A46780280 for ; Tue, 1 Oct 2019 08:02:00 +0000 (UTC) Date: Tue, 1 Oct 2019 08:02:00 +0000 (UTC) From: "Alex Rudyy (Jira)" To: dev@qpid.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (QPID-8366) [Broker-J] The loss of BDB HA majority on invocation of house keeping operations can crash the broker MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Alex Rudyy created QPID-8366: -------------------------------- Summary: [Broker-J] The loss of BDB HA majority on invocation of house keeping operations can crash the broker Key: QPID-8366 URL: https://issues.apache.org/jira/browse/QPID-8366 Project: Qpid Issue Type: Task Components: Broker-J Affects Versions: qpid-java-broker-7.1.4, qpid-java-broker-7.1.3, qpid-java-broker-7.0.8, qpid-java-broker-7.1.2, qpid-java-broker-7.1.1, qpid-java-broker-7.0.7, qpid-java-broker-7.0.6, qpid-java-broker-7.0.5, qpid-java-broker-7.0.4, qpid-java-broker-7.1.0 Reporter: Alex Rudyy The {{ConnectionScopedRuntimeException}} thrown from {{VirtualHost}} {{House Keeping}} thread on invocation of {{MessageStore}} operations like {{checkMessageStatus}} can crash the broker. An example of such exception stack trace (from Qpid Broker version 7.0.6) is provided below: {noformat} 2019-09-27 07:53:38,168 ERROR [virtualhost-test-pool-1] (o.a.q.s.Main) - Uncaught exception, shutting down. org.apache.qpid.server.util.ConnectionScopedRuntimeException: Required number of nodes not reachable at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.handleDatabaseException(ReplicatedEnvironmentFacade.java:495) at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.commit(ReplicatedEnvironmentFacade.java:332) at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.removeMessage(AbstractBDBMessageStore.java:288) at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore$StoredBDBMessage.remove(AbstractBDBMessageStore.java:1090) at org.apache.qpid.server.message.AbstractServerMessageImpl.decrementReference(AbstractServerMessageImpl.java:118) at org.apache.qpid.server.message.AbstractServerMessageImpl.access$500(AbstractServerMessageImpl.java:37) at org.apache.qpid.server.message.AbstractServerMessageImpl$Reference.release(AbstractServerMessageImpl.java:309) at org.apache.qpid.server.queue.QueueEntryImpl.dispose(QueueEntryImpl.java:557) at org.apache.qpid.server.queue.QueueEntryImpl.delete(QueueEntryImpl.java:572) at org.apache.qpid.server.queue.AbstractQueue$11.postCommit(AbstractQueue.java:1729) at org.apache.qpid.server.txn.AutoCommitTransaction.dequeue(AutoCommitTransaction.java:92) at org.apache.qpid.server.queue.AbstractQueue.dequeueEntry(AbstractQueue.java:1722) at org.apache.qpid.server.queue.AbstractQueue.dequeueEntry(AbstractQueue.java:1717) at org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1761) at org.apache.qpid.server.queue.AbstractQueue.checkMessageStatus(AbstractQueue.java:2165) at org.apache.qpid.server.virtualhost.AbstractVirtualHost$VirtualHostHouseKeepingTask.execute(AbstractVirtualHost.java:1965) at org.apache.qpid.server.virtualhost.HouseKeepingTask$1.run(HouseKeepingTask.java:56) at java.security.AccessController.doPrivileged(Native Method) at org.apache.qpid.server.virtualhost.HouseKeepingTask.run(HouseKeepingTask.java:51) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464) at java.lang.Thread.run(Thread.java:748) Caused by: com.sleepycat.je.rep.InsufficientAcksException: (JE 7.4.5) Transaction: -3459038252 VLSN: 10,380,435,448, initiated at: 07:53:20. Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 2. Missing replica acks: 2. Timeout: 15000ms. FeederState=acc3_2(3)[MASTER] Current feeds: acc3_1: feederVLSN=10,380,435,456 replicaTxnEndVLSN=10,380,435,396 acc3: feederVLSN=10,380,435,456 replicaTxnEndVLSN=10,380,435,396 at com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205) at com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189) at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426) at com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385) at com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:228) at com.sleepycat.je.txn.Txn.commit(Txn.java:772) at com.sleepycat.je.Transaction.doCommit(Transaction.java:621) at com.sleepycat.je.Transaction.commit(Transaction.java:401) at org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.commit(ReplicatedEnvironmentFacade.java:328) ... 25 common frames omitted {noformat} The issue reported with the stack trace above occurred when BDB HA {{VirtualHost}} was trying to delete an expired message, but its BDB HA group lost the majority when the {{VirtualHost}} tried to commit a BDB HA transaction for message deletion operation. The majority loss is communicated as {{ConnectionScopeRuntimeException}} to the caller. It seems we need to catch and handle {{ConnectionScopeRuntimeException}} in House Keeping operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org For additional commands, e-mail: dev-help@qpid.apache.org