activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anindya Haldar <anindya.hal...@oracle.com>
Subject Artemis 2.4.0 message loss in durability tests upon system power-off
Date Tue, 06 Feb 2018 01:11:25 GMT
We are in the process of qualifying Artemis 2.4.0 for our stack. We ran some message durability
related tests in the face of a power failure. The broker is running in a VirtualBox VM, and
is set up in a system where disk caching is disabled. The VM runs OEL Linux 7, and the VirtualBox
Manger itself is running under Windows 7 Enterprise. 

 

We use JMS API and persistent messaging. The transaction batch size in the producers is 1,
and the message size for the tests in 1024 bytes. No consumers are running at this time, and
we let the queues build up. Then the VirtualBox VM running the broker is 'powered off' (using
VirtualBox facilities) 5 minutes along the way. The producers detect the broker's absence
and stop.

 

Then we resume the VM and the broker. The broker starts up and we get the queue stats from
it before anything else:

 

|NAME                     |ADDRESS                  |CONSUMER_COUNT |MESSAGE_COUNT |MESSAGES_ADDED
|DELIVERING_COUNT |MESSAGES_ACKED |
|testQueue1               |testQueue1               |0              |106988        |106988
        |0                |0              |
|testQueue2               |testQueue2               |0              |107077        |107077
        |0                |0              |
|testQueue3               |testQueue3               |0              |106996        |106996
        |0                |0              |
|testQueue4               |testQueue4               |0              |107076        |107076
        |0                |0              |

 

The total message count across the queues is 428137.

Now we start the consumers (no producers this time). Finally when the consumers finish, we
get the stats again. The consumers are claiming that they received and acknowledged 428126
messages, which is corroborated by the broker in the MESSAGES_ACKED column.

 

|NAME                     |ADDRESS                  |CONSUMER_COUNT |MESSAGE_COUNT |MESSAGES_ADDED
|DELIVERING_COUNT |MESSAGES_ACKED |

|testQueue1               |testQueue1               |0              |0             |106988
        |0                |106984         |

|testQueue2               |testQueue2               |0              |0             |107077
        |0                |107074         |

|testQueue3               |testQueue3               |0              |0             |106996
        |0                |106992         |

|testQueue4               |testQueue4               |0              |0             |107076
        |0                |107076         |

 

You can clearly see some apparent anomalies:

1)      Post failure, and upon resumption, the broker said it had 428,137 messages in the
test queues, all combined (column MESSAGES_ADDED).

2)      When the consumers consumed it got 428,126 messages and acknowledged all of them.
That is 11 short of 428,137.

3)      The broker, upon the consumers' completion reported 0 queue depth, but also said it
got acknowledgements on 428,126 messages (column MESSAGES_ACKED).

 

Questions:

1)      If we assume the 'MESSAGES_ADDED' column is accurate, then what happed to additional
11 messages that the consumers never received, and, as a result never acknowledged?

2)      If, according to the broker, the number of acknowledged messages is 11 less than the
number of messages added to the queue, why did it declare the queues to be empty when 11 of
the messages were not acknowledged?

3)      If we trust the 'MESSAGES_ADDED' stats as a baseline number then the system lost messages.
And if we do not trust that statistic then what do we trust, and how do we know if it lost
messages?

 

The system ran into this issue 3 out of 4 times I ran the VM power failure test (with slightly
different statistics, of course). We are very concerned that it is symptom of message loss
in the system, and are also concerned about how to explain the anomalies. Will greatly appreciate
any pointer that can help us understand and address the underlying issue here.

 

Thanks,

Anindya Haldar

Oracle Marketing Cloud

 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message