qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramayan Tiwari <ramayan.tiw...@gmail.com>
Subject Re: Java broker OOM due to DirectMemory
Date Thu, 20 Apr 2017 18:20:32 GMT
Hi Keith,

Thanks so much for your response and digging into the issue. Below are the
answer to your questions:

1) Yeah we are using QPID-7462 with 6.0.5. We couldn't use 6.1 where it was
released because we need JMX support. Here is the destination format:
""%s ; {node : { type : queue }, link : { x-subscribes : { arguments : {
x-multiqueue : [%s], x-pull-only : true }}}}";"

2) Our machines have 40 cores, which will make the number of threads to 80.
This might not be an issue, because this will show up in the baseline DM
allocated, which is only 6% (of 4GB) when we just bring up the broker.

3) The only setting that we tuned WRT to DM is flowToDiskThreshold, which
is set at 80% now.

4) Only one virtual host in the broker.

5) Most of our queues (99%) are priority, we also have 8-10 sorted queues.

6) Yeah we are using the standard 0.16 client and not AMQP 1.0 clients. The
connection log line looks like:
CON-1001 : Open : Destination : AMQP(IP:5672) : Protocol Version : 0-10 :
Client ID : test : Client Version : 0.16 : Client Product : qpid

We had another broker crashed about an hour back, we do see the same
patterns:
1) There is a queue which is constantly growing, enqueue is faster than
dequeue on that queue for a long period of time.
2) Flow to disk didn't kick in at all.

This graph shows memory growth (red line - heap, blue - DM allocated,
yellow - DM used)
https://drive.google.com/file/d/0Bwi0MEV3srPRdVhXdTBncHJLY2c/view?usp=sharing

The below graph shows growth on a single queue (there are 10-12 other
queues with traffic as well, something large size than this queue):
https://drive.google.com/file/d/0Bwi0MEV3srPRWmNGbDNGUkJhQ0U/view?usp=sharing

Couple of questions:
1) Is there any developer level doc/design spec on how Qpid uses DM?
2) We are not getting heap dumps automatically when broker crashes due to
DM (HeapDumpOnOutOfMemoryError not respected). Has anyone found a way to
get around this problem?

Thanks
Ramayan

On Thu, Apr 20, 2017 at 9:08 AM, Keith W <keith.wall@gmail.com> wrote:

> Hi Ramayan
>
> We have been discussing your problem here and have a couple of questions.
>
> I have been experimenting with use-cases based on your descriptions
> above, but so far, have been unsuccessful in reproducing a
> "java.lang.OutOfMemoryError: Direct buffer memory"  condition. The
> direct memory usage reflects the expected model: it levels off when
> the flow to disk threshold is reached and direct memory is release as
> messages are consumed until the minimum size for caching of direct is
> reached.
>
> 1] For clarity let me check: we believe when you say "patch to use
> MultiQueueConsumer" you are referring to the patch attached to
> QPID-7462 "Add experimental "pull" consumers to the broker"  and you
> are using a combination of this "x-pull-only"  with the standard
> "x-multiqueue" feature.  Is this correct?
>
> 2] One idea we had here relates to the size of the virtualhost IO
> pool.   As you know from the documentation, the Broker caches/reuses
> direct memory internally but the documentation fails to mentions that
> each pooled virtualhost IO thread also grabs a chunk (256K) of direct
> memory from this cache.  By default the virtual host IO pool is sized
> Math.max(Runtime.getRuntime().availableProcessors() * 2, 64), so if
> you have a machine with a very large number of cores, you may have a
> surprising large amount of direct memory assigned to virtualhost IO
> threads.   Check the value of connectionThreadPoolSize on the
> virtualhost (http://<server>:<port>/api/latest/virtualhost/<virtualhostn
> odename>/<virtualhostname>)
> to see what value is in force.  What is it?  It is possible to tune
> the pool size using context variable
> virtualhost.connectionThreadPool.size.
>
> 3] Tell me if you are tuning the Broker in way beyond the direct/heap
> memory settings you have told us about already.  For instance you are
> changing any of the direct memory pooling settings
> broker.directByteBufferPoolSize, default network buffer size
> qpid.broker.networkBufferSize or applying any other non-standard
> settings?
>
> 4] How many virtual hosts do you have on the Broker?
>
> 5] What is the consumption pattern of the messages?  Do consume in a
> strictly FIFO fashion or are you making use of message selectors
> or/and any of the out-of-order queue types (LVQs, priority queue or
> sorted queues)?
>
> 6] Is it just the 0.16 client involved in the application?   Can I
> check that you are not using any of the AMQP 1.0 clients
> (org,apache.qpid:qpid-jms-client or
> org.apache.qpid:qpid-amqp-1-0-client) in the software stack (as either
> consumers or producers)
>
> Hopefully the answers to these questions will get us closer to a
> reproduction.   If you are able to reliable reproduce it, please share
> the steps with us.
>
> Kind regards, Keith.
>
>
> On 20 April 2017 at 10:21, Ramayan Tiwari <ramayan.tiwari@gmail.com>
> wrote:
> > After a lot of log mining, we might have a way to explain the sustained
> > increased in DirectMemory allocation, the correlation seems to be with
> the
> > growth in the size of a Queue that is getting consumed but at a much
> slower
> > rate than producers putting messages on this queue.
> >
> > The pattern we see is that in each instance of broker crash, there is at
> > least one queue (usually 1 queue) whose size kept growing steadily. It’d
> be
> > of significant size but not the largest queue -- usually there are
> multiple
> > larger queues -- but it was different from other queues in that its size
> > was growing steadily. The queue would also be moving, but its processing
> > rate was not keeping up with the enqueue rate.
> >
> > Our theory that might be totally wrong: If a queue is moving the entire
> > time, maybe then the broker would keep reusing the same buffer in direct
> > memory for the queue, and keep on adding onto it at the end to
> accommodate
> > new messages. But because it’s active all the time and we’re pointing to
> > the same buffer, space allocated for messages at the head of the
> > queue/buffer doesn’t get reclaimed, even long after those messages have
> > been processed. Just a theory.
> >
> > We are also trying to reproduce this using some perf tests to enqueue
> with
> > same pattern, will update with the findings.
> >
> > Thanks
> > Ramayan
> >
> > On Wed, Apr 19, 2017 at 6:52 PM, Ramayan Tiwari <
> ramayan.tiwari@gmail.com>
> > wrote:
> >
> >> Another issue that we noticed is when broker goes OOM due to direct
> >> memory, it doesn't create heap dump (specified by "-XX:+
> >> HeapDumpOnOutOfMemoryError"), even when the OOM error is same as what is
> >> mentioned in the oracle JVM docs ("java.lang.OutOfMemoryError").
> >>
> >> Has anyone been able to find a way to get to heap dump for DM OOM?
> >>
> >> - Ramayan
> >>
> >> On Wed, Apr 19, 2017 at 11:21 AM, Ramayan Tiwari <
> ramayan.tiwari@gmail.com
> >> > wrote:
> >>
> >>> Alex,
> >>>
> >>> Below are the flow to disk logs from broker having 3million+ messages
> at
> >>> this time. We only have one virtual host. Time is in GMT. Looks like
> flow
> >>> to disk is active on the whole virtual host and not a queue level.
> >>>
> >>> When the same broker went OOM yesterday, I did not see any flow to disk
> >>> logs from when it was started until it crashed (crashed twice within
> 4hrs).
> >>>
> >>>
> >>> 4/19/17 4:17:43.509 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3356539KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 2:31:13.502 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3354866KB
> >>> within threshold 3355443KB
> >>> 4/19/17 2:28:43.511 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3358509KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 2:20:13.500 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3353501KB
> >>> within threshold 3355443KB
> >>> 4/19/17 2:18:13.500 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3357544KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 2:08:43.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3353236KB
> >>> within threshold 3355443KB
> >>> 4/19/17 2:08:13.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3356704KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 2:00:43.500 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3353511KB
> >>> within threshold 3355443KB
> >>> 4/19/17 2:00:13.504 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3357948KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 1:50:43.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3355310KB
> >>> within threshold 3355443KB
> >>> 4/19/17 1:47:43.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3365624KB
> >>> exceeds threshold 3355443KB
> >>> 4/19/17 1:43:43.501 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1015 : Message flow to disk inactive : Message memory use 3355136KB
> >>> within threshold 3355443KB
> >>> 4/19/17 1:31:43.509 AM INFO  [Housekeeping[test]] -
> [Housekeeping[test]]
> >>> BRK-1014 : Message flow to disk active :  Message memory use 3358683KB
> >>> exceeds threshold 3355443KB
> >>>
> >>>
> >>> After production release (2days back), we have seen 4 crashes in 3
> >>> different brokers, this is the most pressing concern for us in
> decision if
> >>> we should roll back to 0.32. Any help is greatly appreciated.
> >>>
> >>> Thanks
> >>> Ramayan
> >>>
> >>> On Wed, Apr 19, 2017 at 9:36 AM, Oleksandr Rudyy <orudyy@gmail.com>
> >>> wrote:
> >>>
> >>>> Ramayan,
> >>>> Thanks for the details. I would like to clarify whether flow to disk
> was
> >>>> triggered today for 3 million messages?
> >>>>
> >>>> The following logs are issued for flow to disk:
> >>>> BRK-1014 : Message flow to disk active :  Message memory use
> >>>> {0,number,#}KB
> >>>> exceeds threshold {1,number,#.##}KB
> >>>> BRK-1015 : Message flow to disk inactive : Message memory use
> >>>> {0,number,#}KB within threshold {1,number,#.##}KB
> >>>>
> >>>> Kind Regards,
> >>>> Alex
> >>>>
> >>>>
> >>>> On 19 April 2017 at 17:10, Ramayan Tiwari <ramayan.tiwari@gmail.com>
> >>>> wrote:
> >>>>
> >>>> > Hi Alex,
> >>>> >
> >>>> > Thanks for your response, here are the details:
> >>>> >
> >>>> > We use "direct" exchange, without persistence (we specify
> >>>> NON_PERSISTENT
> >>>> > that while sending from client) and use BDB store. We use JSON
> virtual
> >>>> host
> >>>> > type. We are not using SSL.
> >>>> >
> >>>> > When the broker went OOM, we had around 1.3 million messages with
> 100
> >>>> bytes
> >>>> > average message size. Direct memory allocation (value read from
> MBean)
> >>>> kept
> >>>> > going up, even though it wouldn't need more DM to store these many
> >>>> > messages. DM allocated persisted at 99% for about 3 and half hours
> >>>> before
> >>>> > crashing.
> >>>> >
> >>>> > Today, on the same broker we have 3 million messages (same message
> >>>> size)
> >>>> > and DM allocated is only at 8%. This seems like there is some issue
> >>>> with
> >>>> > de-allocation or a leak.
> >>>> >
> >>>> > I have uploaded the memory utilization graph here:
> >>>> > https://drive.google.com/file/d/0Bwi0MEV3srPRVHFEbDlIYUpLaUE/
> >>>> > view?usp=sharing
> >>>> > Blue line is DM allocated, Yellow is DM Used (sum of queue payload)
> >>>> and Red
> >>>> > is heap usage.
> >>>> >
> >>>> > Thanks
> >>>> > Ramayan
> >>>> >
> >>>> > On Wed, Apr 19, 2017 at 4:10 AM, Oleksandr Rudyy <orudyy@gmail.com>
> >>>> wrote:
> >>>> >
> >>>> > > Hi Ramayan,
> >>>> > >
> >>>> > > Could please share with us the details of messaging use case(s)
> which
> >>>> > ended
> >>>> > > up in OOM on broker side?
> >>>> > > I would like to reproduce the issue on my local broker in
order to
> >>>> fix
> >>>> > it.
> >>>> > > I would appreciate if you could provide as much details as
> possible,
> >>>> > > including, messaging topology, message persistence type, message
> >>>> > > sizes,volumes, etc.
> >>>> > >
> >>>> > > Qpid Broker 6.0.x uses direct memory for keeping message content
> and
> >>>> > > receiving/sending data. Each plain connection utilizes 512K
of
> direct
> >>>> > > memory. Each SSL connection uses 1M of direct memory. Your
memory
> >>>> > settings
> >>>> > > look Ok to me.
> >>>> > >
> >>>> > > Kind Regards,
> >>>> > > Alex
> >>>> > >
> >>>> > >
> >>>> > > On 18 April 2017 at 23:39, Ramayan Tiwari <
> ramayan.tiwari@gmail.com>
> >>>> > > wrote:
> >>>> > >
> >>>> > > > Hi All,
> >>>> > > >
> >>>> > > > We are using Java broker 6.0.5, with patch to use
> >>>> MultiQueueConsumer
> >>>> > > > feature. We just finished deploying to production and
saw
> couple of
> >>>> > > > instances of broker OOM due to running out of DirectMemory
> buffer
> >>>> > > > (exceptions at the end of this email).
> >>>> > > >
> >>>> > > > Here is our setup:
> >>>> > > > 1. Max heap 12g, max direct memory 4g (this is opposite
of what
> the
> >>>> > > > recommendation is, however, for our use cause message
payload is
> >>>> really
> >>>> > > > small ~400bytes and is way less than the per message
overhead of
> >>>> 1KB).
> >>>> > In
> >>>> > > > perf testing, we were able to put 2 million messages
without any
> >>>> > issues.
> >>>> > > > 2. ~400 connections to broker.
> >>>> > > > 3. Each connection has 20 sessions and there is one multi
queue
> >>>> > consumer
> >>>> > > > attached to each session, listening to around 1000 queues.
> >>>> > > > 4. We are still using 0.16 client (I know).
> >>>> > > >
> >>>> > > > With the above setup, the baseline utilization (without
any
> >>>> messages)
> >>>> > for
> >>>> > > > direct memory was around 230mb (with 410 connection each
taking
> >>>> 500KB).
> >>>> > > >
> >>>> > > > Based on our understanding of broker memory allocation,
message
> >>>> payload
> >>>> > > > should be the only thing adding to direct memory utilization
(on
> >>>> top of
> >>>> > > > baseline), however, we are experiencing something completely
> >>>> different.
> >>>> > > In
> >>>> > > > our last broker crash, we see that broker is constantly
running
> >>>> with
> >>>> > 90%+
> >>>> > > > direct memory allocated, even when message payload sum
from all
> the
> >>>> > > queues
> >>>> > > > is only 6-8% (these % are against available DM of 4gb).
During
> >>>> these
> >>>> > high
> >>>> > > > DM usage period, heap usage was around 60% (of 12gb).
> >>>> > > >
> >>>> > > > We would like some help in understanding what could be
the
> reason
> >>>> of
> >>>> > > these
> >>>> > > > high DM allocations. Are there things other than message
payload
> >>>> and
> >>>> > AMQP
> >>>> > > > connection, which use DM and could be contributing to
these high
> >>>> usage?
> >>>> > > >
> >>>> > > > Another thing where we are puzzled is the de-allocation
of DM
> byte
> >>>> > > buffers.
> >>>> > > > From log mining of heap and DM utilization, de-allocation
of DM
> >>>> doesn't
> >>>> > > > correlate with heap GC. If anyone has seen any documentation
> >>>> related to
> >>>> > > > this, it would be very helpful if you could share that.
> >>>> > > >
> >>>> > > > Thanks
> >>>> > > > Ramayan
> >>>> > > >
> >>>> > > >
> >>>> > > > *Exceptions*
> >>>> > > >
> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> >>>> > > ~[na:1.8.0_40]
> >>>> > > > at
> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> >>>> > > > QpidByteBuffer.java:474)
> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainD
> >>>> elegate.
> >>>> > > > restoreApplicationBufferForWrite(NonBlockingConnectionPlainDele
> >>>> > > > gate.java:93)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> >>>> > > > gate.processData(NonBlockingConnectionPlainDelegate.java:60)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doRead(
> >>>> > > > NonBlockingConnection.java:506)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.doWork(
> >>>> > > > NonBlockingConnection.java:285)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NetworkConnectionScheduler.
> >>>> > > > processConnection(NetworkConnectionScheduler.java:124)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread$ConnectionPr
> >>>> ocessor.
> >>>> > > > processConnection(SelectorThread.java:504)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread$
> >>>> > > > SelectionTask.performSelect(SelectorThread.java:337)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
> k.run(
> >>>> > > > SelectorThread.java:87)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
> >>>> > > > SelectorThread.java:462)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >>>> > > > ThreadPoolExecutor.java:1142)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at
> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >>>> > > > ThreadPoolExecutor.java:617)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> >>>> > > >
> >>>> > > >
> >>>> > > >
> >>>> > > > *Second exception*
> >>>> > > > java.lang.OutOfMemoryError: Direct buffer memory
> >>>> > > > at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.8.0_40]
> >>>> > > > at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> >>>> > > ~[na:1.8.0_40]
> >>>> > > > at
> >>>> > > > org.apache.qpid.bytebuffer.QpidByteBuffer.allocateDirect(
> >>>> > > > QpidByteBuffer.java:474)
> >>>> > > > ~[qpid-common-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnectionPlainDele
> >>>> > > > gate.<init>(NonBlockingConnectionPlainDelegate.java:45)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.
> >>>> > > > setTransportEncryption(NonBlockingConnection.java:625)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingConnection.<init>(
> >>>> > > > NonBlockingConnection.java:117)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.NonBlockingNetworkTransport.
> >>>> > > > acceptSocketChannel(NonBlockingNetworkTransport.java:158)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread$SelectionTas
> >>>> k$1.run(
> >>>> > > > SelectorThread.java:191)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > org.apache.qpid.server.transport.SelectorThread.run(
> >>>> > > > SelectorThread.java:462)
> >>>> > > > ~[qpid-broker-core-6.0.5.jar:6.0.5]
> >>>> > > > at
> >>>> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >>>> > > > ThreadPoolExecutor.java:1142)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at
> >>>> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >>>> > > > ThreadPoolExecutor.java:617)
> >>>> > > > ~[na:1.8.0_40]
> >>>> > > > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message