qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom M <td.hom...@gmail.com>
Subject Re: problem with qpid heartbeats when sending msgs with size over 1KB
Date Sat, 07 Jan 2012 01:26:21 GMT
I’ve created a simpler test client (based on our deployed application) to
test this problem.

It opens a connection (with connSettings.heartbeat = 8) and session and
then starts sending messages (once a second), using
MessageReplayTracker::send(
).  The size of the message is selectable by the user (messages of  6 bytes
or 1600bytes).

After about 30 seconds, I pulled the network cable.

* If sending msgs of only a 6 bytes, a qpid::Exception is thrown indicating
the connection is closed (about 16 seconds after the cable was pulled).

* If sending msgs of 1600 bytes, the client continued sending without
detecting the connection closed (continued sending for about another
88seconds, then apparently stopped due to an out queue full, but never
reporting an exception during this time).   From previous testing, it does
eventually timeout (probably from the TCP timeout) and throws an exception,
but not until 6 to 20mins later.



I’ve attached the results from 4 runs (for each one, the network connection
pulled at 30seconds into the xfers).

Two runs sending msgs of only a 6bytes (good cases) and 2 runs sending msgs
of 1600bytes (problem cases).

For each, I’m including the logged output and pstack dumps.

Note in particular trace logging for the problem case: 01_03d.   It may be
notable that the Frame SEND’s stopped before the heartbeat timeout should
have occurred (see details below).



To help look for differences, I took pstack dumps at several points in time
during the runs (included sequentially in the pstack.*.log files for each
run, time is printed just above each block).  Primarily:

   * pstack before connection pulled (20 seconds into xfers)

   * pstack 10 seconds after connection pulled (a few seconds prior to 2x
heartbeat rate)

   * for cases sending 1600byte msgs, several more pstack’s after the point
that heartbeat timeout should have occurred (after pulled connection, by:  20,
30, 40 seconds)

        Then one more pstack after the sends stopped (due to output full).



For each of the msg sizes, I did the first run with default logging (to
show app logging) and then the second with QPID_LOG set to “trace+”



Details about the 4 runs:

* 01_03a:   sent msg’s size =   6 bytes  (successful case,  app only
logging)

       about 06:20:29 cable pulled at (about at sent msg:  30)

       06:20:45 see qpid::Exception  Connection closed

       Only app logging:     qpidConnTst.msgSmall.01_03a.log

       pstack.msgSmall.01_03a.log



* 01_03b:   sent msg’s size =   6 bytes  (successful case,  trace logging)

          06:55:55  last  Heartbeat RECV & SENT

       about 06:56:01  cable pulled at (about at sent msg:  30)

       06:56:17  see qpid::Exception  Connection closed

       Trace  logging:     qpidConnTst.msgSmall.01_03b.log

       pstack.msgSmall.01_03b.log



* 01_03c:   sent msg’s size = 1600 bytes  (problem case,  app only logging)

       about 07:06:15  cable pulled at (about at sent msg:  30)

         Note:  client continued sending without receiving an exception

            Continued to send until  msg:  118  (additional 88 seconds
after cable pulled),

                then blocked.

            I terminated more than 4 mins later, still had not received
exception



* 01_03d:   sent msg’s size = 1600 bytes  (problem case,  app only logging)

          07:18:27 last  Heartbeat RECV & SENT

       about 07:18:28  cable pulled at (about at sent msg:  30)

         Note:  client continued sending 11 more msgs (11 more seconds).

            Then the client app continued to make calls
MessageReplayTracker::send(
)

                 without receiving an exception, but the traces stopped for
Frame SENT’s

                 NOTE:  this is before the  heartbeat timeout should have
occurred.

         07:18:43 debug Traffic timeout, but still no exception caught by
app.

         The app continued to make calls  MessageReplayTracker::send( )

                 without receiving an exception.

            Continued to send until  msg:  118  (additional 88 seconds
after cable pulled),

                then app was blocked.

            I terminated more than 4 mins later, still had not received
exception





Some additional notes about this client:

* This client (as with some of the instances in our deployed application)
is a message sender only.

       * since not intending on receiving messages on this client, we
did NOT derive this client from qpid::client::MessageListener

   * sending using:   qpid::client::MessageReplayTracker::send( )

   * using async sending (for speed), not calling  Session::sync( ).

        However, I tried including adding  sync’s, and found that, when
sending the 1600 byte msgs, the client would only block on the sync and
still not get an exception for the heartbeat timeouts.

    * I added a call to  Connection::isOpen( ) in the loop to see if this
would trigger the exception, but it did not.

+ we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on both
client and broker hosts) .   In our deployed systems, we will not be able
to upgrade qpid versions for at least several months.

I will post the test code in another message.

Please let us know what you can find.

Thanks,

Tom



On Thu, Dec 8, 2011 at 7:37 AM, Tom M <td.home21@gmail.com> wrote:

> Gordon,
> thank you for your response.
>
> Yes, the test with larger message has failed every time.
> I will look at sending the trace logs and a thread dump from the client
> from our system.
>
> thanks,
> Tom Maggio
> Raytheon Co.
> Dallas, TX
> 972/205-4377
>
>   On Wed, Dec 7, 2011 at 8:37 AM, Gordon Sim <gsim@redhat.com> wrote:
>
>> On 12/07/2011 01:55 AM, Tom M wrote:
>>
>>> Hello,
>>>
>>> we are having a problem with our MRG (qpid) system:
>>>
>>> * when sending messages with size of 1600bytes, a connection (used for
>>> sending from client) does not detect the host connection is lost via
>>> heartbeat timeout.
>>>
>>> + we are using C++ qpid client 0.7 and qpidd 0.7 (linux 2.6 x86_64 on
>>> both
>>> client and broker hosts)
>>>
>>> and Ethernet connection (TCP/IP) between hosts
>>>
>>>     + for this connection we have: ConnectionSettings
>>> connectionSettings.heartbeat = 8
>>>
>>>     + simulating a system failure by pulling the ethernet cable to the
>>> broker host
>>>
>>>     + the connection close Exception is caught by the client after many
>>> minutes (6 to 20mins), I'm guessing this is due to the TCP timeout and
>>> not
>>> the missed heartbeats.
>>>
>>>     + with the same exact application (for our client), if sending
>>> messages
>>> of 200bytes, we do get the qpid exception indicating the Connection
>>> closed
>>> (catch TransportFailure Exception: connection closed) within 16 seconds.
>>> For this testing, there were no other changes between the 2 cases, other
>>> than the size of the messages sent from the client (only expanded the
>>> size
>>> of the string in the body of the message) (1 message sent per second in
>>> both cases).
>>>
>>> * is this a known problem with qpid 0.7?
>>>
>>
>> No, i don't think this is a known issue.
>>
>>
>> * is there patch to fix this for qpid 0.7?
>>>
>>> * has this problem already been fixed in later releases?
>>>
>>> NOTE: we have already deployed qpid 0.7 in our system, and we will not be
>>> able to upgrade to a newer full release for many months.
>>>
>>> I'm wondering if the problem is that the connection gets blocked with the
>>> first TCP packet of a multiple packet message, such that the heartbeat
>>> detection is disabled until the full message is sent. But, if the
>>> multi-packet message can not complete (since socket is broken), the
>>> heartbeat logic is held disabled until the multi-packet message can
>>> complete (which in this case it can not).
>>>
>>
>> There is nothing that directly (intentionally) does anything like this.
>> However it may be possible that there is some deadlock or liveness issue
>> that prevents correct function in some cases.
>>
>> Is the test always failing with the larger message size? There is
>> actually no difference in the AMQP framing for a 200 byte v a 1600 byte
>> message. It may just be that the different timing of the larger write
>> somehow triggers the issue.
>>
>> Can you get trace level logs and a thread dump from the client for a
>> failed case?
>>
>> ------------------------------**------------------------------**---------
>> Apache Qpid - AMQP Messaging Implementation
>> Project:      http://qpid.apache.org
>> Use/Interact: mailto:users-subscribe@qpid.**apache.org<users-subscribe@qpid.apache.org>
>>
>>
>

Mime
View raw message