incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Samuelsson <samuelsson.j...@gmail.com>
Subject Re: Intermittent long application pauses on nodes
Date Thu, 20 Feb 2014 15:29:01 GMT
Hi Frank,

We got a (quite) long GC pause today on 2.0.5:
 INFO [ScheduledTasks:1] 2014-02-20 13:51:14,528 GCInspector.java (line
116) GC for ParNew: 1627 ms for 1 collections, 425562984 used; max is
4253024256
 INFO [ScheduledTasks:1] 2014-02-20 13:51:14,542 GCInspector.java (line
116) GC for ConcurrentMarkSweep: 3703 ms for 2 collections, 434394920 used;
max is 4253024256

Unfortunately it's a production cluster so I have no additional GC-logging
enabled. This may be an indication that upgrading is not the (complete)
solution.

Regards,
Joel


2014-02-17 13:41 GMT+01:00 Benedict Elliott Smith <
belliottsmith@datastax.com>:

> Hi Ondrej,
>
> It's possible you were hit by the problems in this thread before, but it
> looks potentially like you may have other issues. Of course it may be that
> on G1 you have one issue and CMS another, but 27s is extreme even for G1,
> so it seems unlikely. If you're hitting these pause times in CMS and you
> get some more output from the safepoint tracing, please do contribute as I
> would love to get to the bottom of that, however is it possible you're
> experiencing paging activity? Have you made certain the VM memory is locked
> (and preferably that paging is entirely disabled, as the bloom filters and
> other memory won't be locked, although that shouldn't cause pauses during
> GC)
>
> Note that mmapped file accesses and other native work shouldn't in anyway
> inhibit GC activity or other safepoint pause times, unless there's a bug in
> the VM. These threads will simply enter a safepoint as they return to the
> VM execution context, and are considered safe for the duration they are
> outside.
>
>
>
>
> On 17 February 2014 12:30, Ondřej Černoš <cernoso@gmail.com> wrote:
>
>> Hi,
>>
>> we tried to switch to G1 because we observed this behaviour on CMS too
>> (27 seconds pause in G1 is quite an advise not to use it). Pauses with CMS
>> were not easily traceable - JVM stopped even without stop-the-world pause
>> scheduled (defragmentation, remarking). We thought the go-to-safepoint
>> waiting time might have been involved (we saw waiting for safepoint
>> resolution) - especially because access to mmpaped files is not preemptive,
>> afaik, but it doesn't explain tens of seconds waiting times, even slow IO
>> should read our sstables into memory in much less time. We switched to G1
>> out of desperation - and to try different code paths - not that we'd
>> thought it was a great idea. So I think we were hit by the problem
>> discussed in this thread, just the G1 report wasn't very clear, sorry.
>>
>> regards,
>> ondrej
>>
>>
>>
>> On Mon, Feb 17, 2014 at 11:45 AM, Benedict Elliott Smith <
>> belliottsmith@datastax.com> wrote:
>>
>>> Ondrej,
>>>
>>> It seems like your issue is much less difficult to diagnose: your
>>> collection times are long. At least, the pause you printed the time for is
>>> all attributable to the G1 pause.
>>>
>>> Note that G1 has not generally performed well with Cassandra in our
>>> testing. There are a number of changes going in soon that may change that,
>>> but for the time being it is advisable to stick with CMS. With tuning you
>>> can no doubt bring your pauses down considerably.
>>>
>>>
>>> On 17 February 2014 10:17, Ondřej Černoš <cernoso@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> we are seeing the same kind of long pauses in Cassandra. We tried to
>>>> switch CMS to G1 without positive result. The stress test is read heavy,
2
>>>> datacenters, 6 nodes, 400reqs/sec on one datacenter. We see spikes in
>>>> latency on 99.99 percentil and higher, caused by threads being stopped in
>>>> JVM.
>>>>
>>>> The GC in G1 looks like this:
>>>>
>>>> {Heap before GC invocations=4073 (full 1):
>>>> garbage-first heap   total 8388608K, used 3602914K [0x00000005f5c00000,
>>>> 0x00000007f5c00000, 0x00000007f5c00000)
>>>>  region size 4096K, 142 young (581632K), 11 survivors (45056K)
>>>> compacting perm gen  total 28672K, used 27428K [0x00000007f5c00000,
>>>> 0x00000007f7800000, 0x0000000800000000)
>>>>   the space 28672K,  95% used [0x00000007f5c00000, 0x00000007f76c9108,
>>>> 0x00000007f76c9200, 0x00000007f7800000)
>>>> No shared spaces configured.
>>>> 2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation
>>>> Pause) (young)
>>>> Desired survivor size 37748736 bytes, new threshold 15 (max 15)
>>>> - age   1:   17213632 bytes,   17213632 total
>>>> - age   2:   19391208 bytes,   36604840 total
>>>> , 0.1664300 secs]
>>>>   [Parallel Time: 163.9 ms, GC Workers: 2]
>>>>      [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max:
>>>> 222346218.3, Diff: 0.0]
>>>>      [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff: 1.7,
>>>> Sum: 13.7]
>>>>      [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7, Sum:
>>>> 42.6]
>>>>         [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22, Sum:
>>>> 120]
>>>>      [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum:
>>>> 46.5]
>>>>      [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff: 0.1,
>>>> Sum: 224.6]
>>>>      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum:
>>>> 0.1]
>>>>      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0,
>>>> Sum: 0.1]
>>>>      [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff:
>>>> 0.0, Sum: 327.6]
>>>>      [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max:
>>>> 222346382.1, Diff: 0.0]
>>>>   [Code Root Fixup: 0.0 ms]
>>>>   [Clear CT: 0.4 ms]
>>>>   [Other: 2.1 ms]
>>>>      [Choose CSet: 0.0 ms]
>>>>      [Ref Proc: 1.1 ms]
>>>>      [Ref Enq: 0.0 ms]
>>>>      [Free CSet: 0.4 ms]
>>>>   [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap:
>>>> 3518.5M(8192.0M)->3018.5M(8192.0M)]
>>>> Heap after GC invocations=4074 (full 1):
>>>> garbage-first heap   total 8388608K, used 3090914K [0x00000005f5c00000,
>>>> 0x00000007f5c00000, 0x00000007f5c00000)
>>>>  region size 4096K, 17 young (69632K), 17 survivors (69632K)
>>>> compacting perm gen  total 28672K, used 27428K [0x00000007f5c00000,
>>>> 0x00000007f7800000, 0x0000000800000000)
>>>>   the space 28672K,  95% used [0x00000007f5c00000, 0x00000007f76c9108,
>>>> 0x00000007f76c9200, 0x00000007f7800000)
>>>> No shared spaces configured.
>>>> }
>>>> [Times: user=0.35 sys=0.00, real=27.58 secs]
>>>> 222346.219: G1IncCollectionPause             [     111          0
>>>>        0    ]      [     0     0     0     0 27586    ]  0
>>>>
>>>> And the total thime for which application threads were stopped is 27.58
>>>> seconds.
>>>>
>>>> CMS behaves in a similar manner. We thought it would be GC, waiting for
>>>> mmaped files being read from disk (the thread cannot reach safepoint during
>>>> this operation), but it doesn't explain the huge time.
>>>>
>>>> We'll try jhiccup to see if it provides any additional information. The
>>>> test was done on mixed aws/openstack environment, openjdk 1.7.0_45,
>>>> cassandra 1.2.11. Upgrading to 2.0.x is no option for us.
>>>>
>>>> regards,
>>>>
>>>> ondrej cernos
>>>>
>>>>
>>>> On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng <fntemk@gmail.com> wrote:
>>>>
>>>>> Sorry, I have not had a chance to file a JIRA ticket.  We have not
>>>>> been able to resolve the issue.  But since Joel mentioned that upgrading
to
>>>>> Cassandra 2.0.X solved it for them, we may need to upgrade.  We are
>>>>> currently on Java 1.7 and Cassandra 1.2.8
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright <kwright@nanigans.com>wrote:
>>>>>
>>>>>> You’re running 2.0.* in production?  May I ask what C* version
and
>>>>>> OS?  Any hardware details would be appreciated as well.  Thx!
>>>>>>
>>>>>> From: Joel Samuelsson <samuelsson.joel@gmail.com>
>>>>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>>>> Date: Thursday, February 13, 2014 at 11:39 AM
>>>>>>
>>>>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>>
>>>>>> We have had similar issues and upgrading C* to 2.0.x and Java to
1.7
>>>>>> seems to have helped our issues.
>>>>>>
>>>>>>
>>>>>> 2014-02-13 Keith Wright <kwright@nanigans.com>:
>>>>>>
>>>>>>> Frank did you ever file a ticket for this issue or find the root
>>>>>>> cause?  I believe we are seeing the same issues when attempting
to
>>>>>>> bootstrap.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> From: Robert Coli <rcoli@eventbrite.com>
>>>>>>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>>>>> Date: Monday, February 3, 2014 at 6:10 PM
>>>>>>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>>>>>>> Subject: Re: Intermittent long application pauses on nodes
>>>>>>>
>>>>>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith <
>>>>>>> belliottsmith@datastax.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> It's possible that this is a JVM issue, but if so there may
be some
>>>>>>>> remedial action we can take anyway. There are some more flags
we should
>>>>>>>> add, but we can discuss that once you open a ticket. If you
could include
>>>>>>>> the strange JMX error as well, that might be helpful.
>>>>>>>>
>>>>>>>
>>>>>>> It would be appreciated if you could inform this thread of the
JIRA
>>>>>>> ticket number, for the benefit of the community and google searchers.
:)
>>>>>>>
>>>>>>> =Rob
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message