It sounds like your switch fabric might be the issue?
Those types of hangs should show pretty frequent kernel alarms.
On Jun 2, 2013, at 21:10, Christian Posta <christian.posta@gmail.com> wrote:
> You should checkout the failover transport to handle reconnecting.
>
> On Sunday, June 2, 2013, fenbers wrote:
>
>>
>>
>>
>>
>>
>> I don't know how to determine the NFS version but we are running on
>> RHEL 5.5.
>>
>> I have not checked the syslog. Thanks for the tip. I will
>> do that
>> after our morning Operations.
>>
>> We are also very inclined to believe this is an NFS issue, based on
>> behaviors networkwide which have nothing to do with ActiveMQ, e.g,
>> often taking 10 seconds to list just 5 files in an NFSmounted
>> directory.
>>
>> So, we are creating an action plan this weekend to eliminate as many
>> NFS mount points as possible, and seeing how that helps the
>> situation. The plan needs approval/buyin from key people to be
>> implemented, so it may be a couple of weeks to implement the
>> plan.
>> In the meantime, ActiveMQ either shuts itself down or behaves in
>> rather despondent ways, so we find we are having to restart ActiveMQ
>> every 3 or 4 hours (and this frequency is slowly increasing).
>>
>> Once ActiveMQ is rebooted, we find that our producers and our
>> consumers have to be shut down and relaunched in order to
>> reestablish the connection with ActiveMQ. This is a royal
>> pain!
>> However, a producer will throw an exception whenever it tries to
>> send a message through a lost connection, and so I catch the
>> exception where I close the connection and reopen it. Thus, my
>> producers are able to reconnect automatically in the event ActiveMQ
>> is restarted.
>>
>> But with the consumers, no exception is thrown as it waits for
>> notifications. It simply waits for a notification that never
>> happens after the connection with ActiveMQ is lost. So what is
>> your
>> recommended method for a consumer to check for a disconnection??
>> (Maybe I should post his question as a separate thread...)
>>
>> Mark
>>
>>
>> On 5/29/2013 3:21 AM, rajdavies [via
>> ActiveMQ] wrote:
>>
>> Ultimately I'm pretty confident this problem is an
>> NFS problem  and as Johan has already let the cat out of the
>> bag
>> ;)  let me ask the following:
>>
>>
>> Which version of NFS 4 are you using and which environment?
>>
>> Have you checked the system logs for NFS errors on all the
>> machines running ActiveMQ brokers ?
>>
>>
>> thanks,
>>
>>
>> Rob
>>
>>
>> On 29 May 2013, at 00:46, Christian Posta < [hidden email] >
>> wrote:
>>
>>
>> > I can make two recommendations.
>>
>> >
>> > #1, being the preferred, create a test case that shows
>> this... that will
>>
>> > give us the best chance of finding out what's going on...
>> take a look at
>>
>> > the following test cases in the activemq source code to
>> give you an idea
>>
>> > about how to go about doing it...
>>
>> >
>> >
>> http://svn.apache.org/viewvc/activemq/trunk/activemqunittests/src/test/java/org/apache/activemq/usecases/
>> >
>> >
>> http://svn.apache.org/viewvc/activemq/trunk/activemqunittests/src/test/java/org/apache/activemq/bugs/
>> >
>> >
>> http://svn.apache.org/viewvc/activemq/trunk/activemqunittests/src/test/java/org/apache/activemq/test/JmsTopicSendReceiveTest.java?view=markup
>> >
>> >
>> > #2, if creating a test case doesn't sound like something
>> you want to get
>>
>> > into.. i guess, give us the exact configs of broker,
>> clients, number of
>>
>> > consumers, number of topics, message sizes, etc, etc all
>> details and if one
>>
>> > of us gets the urge we can try it out on our boxes. this
>> will not be nearly
>>
>> > as good as #1, and will provide a higher barrier to entry
>> because we spend
>>
>> > our spare time doing this and like to spend that time
>> debugging and fixing,
>>
>> > and not setting up environments and usecases which may not
>> even show a bug
>>
>> > :)
>>
>> >
>> >
>> >
>> >
>> > On Tue, May 28, 2013 at 4:34 PM, fenbers < [hidden email]
>> >
>> wrote:
>>
>> >
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> I'm getting the Sync exception on both,
>> local and
>> NFS.&nbsp;
>>
>> >> Originally,
>>
>> >> I was only using a local disk, but there
>> wasn't much
>> disk space for
>>
>> >> the ever growing list of 33MB enumerated
>> .log files
>> that weren't
>>
>> >> cleaned up.&nbsp; So I reconfigured
>> ActiveMQ to
>> put these db files on
>>
>> >> an
>>
>> >> NFS mount.&nbsp; But the sync
exceptions
>> occurred either way.
>>
>> >>
>> >> I've changed *all* my consumers to
>> AUTO_ACKNOWLEDGE,
>> thinking that
>>
>> >> maybe an ACKNOWLEDGEment leak was causing
the
>> undeleted files.&nbsp;
>>
>> >> That
>>
>> >> didn't help...&nbsp; The TRACE
level
>> logging
>> points to only two of my 5
>>
>> >> topics that accumulate these undeleted
db
>> files.&nbsp; So I've
>>
>> >> concentrated by scrutiny over consumers
of
>> these two
>> topics.&nbsp; But
>>
>> >> have not found anything out of the
>> ordinary.&nbsp;
>>
>> >>
>> >> What is puzzling me still, is that the
>> frequency of
>> the log file
>>
>> >> buildup and the frequency of exceptions
>> continues
>> to increase even
>>
>> >> though the amount of messages sent per
day
>> by the
>> producers remains
>>
>> >> nearly constant...
>>
>> >> Mark
>>
>> >>
>> >> On 5/28/2013 6:06 PM, ceposta [via
>>
>> >> ActiveMQ] wrote:
>>
>> >>
>> >> Sounds like there's multiple issues...
>>
>> >>
>> >> You're journal files aren't
being
>> cleaned up, AND
>> you're getting
>>
>> >> the Sync
>>
>> >>
>> >> exception?
>>
>> >>
>> >> You get the sync exception
on local
>> disk mount? Or
>> just NFS?
>>
>> >>
>> >>
>> >> If the journals aren't being
cleaned
>> up, are your
>> consumers
>>
>> >> properly
>>
>> >>
>> >> ack'ing messages?
>>
>> >>
>> >>
>> >>
>> >> On Tue, May 28, 2013 at 2:42
PM,
>> fenbers &lt;
>> [hidden email] &gt;
>>
>> >> wrote:
>>
>> >>
>> >>
>> >> &gt;
>>
>> >>
>> >> &gt;
>>
>> >>
>> >> &gt;
>>
>> >>
>> >> &gt;
>>
>> >>
>> >> &gt;
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> I would LOVE to
>> help you help me!&amp;nbsp; But
>>
>> >> I have
>>
>> >> no idea how to
go
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> about making a
>> test case.&amp;nbsp; If you
>>
>> >> could drop
>>
>> >> some hints in
this
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> regard, I might
>> be able to produce one.
>>
>> >>
>> >> &gt;
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> My ActiveMQ
>> issues seem to be related to network
>>
>> >> slowness, which
we
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> are diagnosing
>> separately.&amp;nbsp; Or maybe
>>
>> >> it is the
>>
>> >> other way around,
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> where ActiveMQ
>> problems are causing network
>>
>> >> sluggishness.&amp;nbsp;
>> Either
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> way, there seems
>> to be a correlation, except
>>
>> >> that when
>>
>> >> network
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> responsiveness
>> improves, ActiveMQ does not.
>>
>> >>
>> >> &gt;
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> The problem I'm
>> having with AMQ is progressive,
>>
>> >> which
>>
>> >> is even more
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> puzzling, because
>> we are not adding to the
>>
>> >> number of
>>
>> >> messages that
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> AMQ has to
>> handle.&amp;nbsp; Today, we were up
>>
>> >> to 191
>>
>> >> undeleted dbNNN.log
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> files in the
>> database directory before I
>>
>> >> stopped AMQ
>>
>> >> and deleted
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> them.&amp;nbsp;&amp;nbsp; NNN was up to 451, so
>>
>> >> 260
>>
>> >> files had been
cleaned up
>>
>> >>
>> >> &gt; by
AMQ's
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> automatic
>> processes...
>>
>> >>
>> >> &gt;
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> Will log files
>> assist you in helping
>>
>> >> me?&amp;nbsp; I
>>
>> >> have TRACE level
>>
>> >>
>> >> &gt; &nbsp;
&nbsp;
>> messages turned
>> on, so they are quite large.
>>
>> >>
>> >> &gt;
>>
>> >>
>>
>> <
>
>
>
> 
> *Christian Posta*
> http://www.christianposta.com/blog
> twitter: @christianposta
