Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 180F610ACB for ; Mon, 17 Feb 2014 10:45:39 +0000 (UTC) Received: (qmail 37396 invoked by uid 500); 17 Feb 2014 10:45:35 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 37365 invoked by uid 500); 17 Feb 2014 10:45:33 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 37355 invoked by uid 99); 17 Feb 2014 10:45:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Feb 2014 10:45:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of belliottsmith@datastax.com designates 209.85.213.177 as permitted sender) Received: from [209.85.213.177] (HELO mail-ig0-f177.google.com) (209.85.213.177) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Feb 2014 10:45:26 +0000 Received: by mail-ig0-f177.google.com with SMTP id k19so4899353igc.4 for ; Mon, 17 Feb 2014 02:45:05 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=PX/515qMwU7UXnOhBxhgnttB9lOVD942V81smDDcrTc=; b=EKEL8EoGJCW9gR+nFWiDa2McKdbe6ADwjsc6Xa3U7BdFPG/CnynVebLdcj+XGlYFhe VDwjAbJI46czpXgYFKJp1nTxpwBbealUbgxcZFyFfKAVbDwu6mKC1kGCqeWsWZ65ol7Z yPu2j4U6I2NzHbp1FcxX3XRTSib30sOfnSrFI4UmD+mFmHCclrO9/q5cUuISTgNgpJ8X ELKACdI9Z42iZq0Kfve5hPIN8Xy4HRq9h2oBEbgy5zMJlBvcVUgOqjFyA3RtG6ojlHdh PSW9msmkFQNC7AfxFyUM93E2g7DMYCpFuCcgvs3CD7V0qEopj4ACmfoYqz8942bHBijw ubHw== X-Gm-Message-State: ALoCoQm4PSpgRsrOWjxao7gcR18wfxJ3YskXkLr85a/pcC8WWDZ7BepOI05uBH8sAte7b2PaI10m MIME-Version: 1.0 X-Received: by 10.42.173.68 with SMTP id q4mr911533icz.41.1392633905067; Mon, 17 Feb 2014 02:45:05 -0800 (PST) Received: by 10.43.98.202 with HTTP; Mon, 17 Feb 2014 02:45:04 -0800 (PST) In-Reply-To: References: Date: Mon, 17 Feb 2014 10:45:04 +0000 Message-ID: Subject: Re: Intermittent long application pauses on nodes From: Benedict Elliott Smith To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=90e6ba6e863a36d36304f297d940 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba6e863a36d36304f297d940 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Ondrej, It seems like your issue is much less difficult to diagnose: your collection times are long. At least, the pause you printed the time for is all attributable to the G1 pause. Note that G1 has not generally performed well with Cassandra in our testing. There are a number of changes going in soon that may change that, but for the time being it is advisable to stick with CMS. With tuning you can no doubt bring your pauses down considerably. On 17 February 2014 10:17, Ond=C5=99ej =C4=8Cerno=C5=A1 = wrote: > Hi all, > > we are seeing the same kind of long pauses in Cassandra. We tried to > switch CMS to G1 without positive result. The stress test is read heavy, = 2 > datacenters, 6 nodes, 400reqs/sec on one datacenter. We see spikes in > latency on 99.99 percentil and higher, caused by threads being stopped in > JVM. > > The GC in G1 looks like this: > > {Heap before GC invocations=3D4073 (full 1): > garbage-first heap total 8388608K, used 3602914K [0x00000005f5c00000, > 0x00000007f5c00000, 0x00000007f5c00000) > region size 4096K, 142 young (581632K), 11 survivors (45056K) > compacting perm gen total 28672K, used 27428K [0x00000007f5c00000, > 0x00000007f7800000, 0x0000000800000000) > the space 28672K, 95% used [0x00000007f5c00000, 0x00000007f76c9108, > 0x00000007f76c9200, 0x00000007f7800000) > No shared spaces configured. > 2014-02-17T04:44:16.385+0100: 222346.218: [GC pause (G1 Evacuation Pause) > (young) > Desired survivor size 37748736 bytes, new threshold 15 (max 15) > - age 1: 17213632 bytes, 17213632 total > - age 2: 19391208 bytes, 36604840 total > , 0.1664300 secs] > [Parallel Time: 163.9 ms, GC Workers: 2] > [GC Worker Start (ms): Min: 222346218.3, Avg: 222346218.3, Max: > 222346218.3, Diff: 0.0] > [Ext Root Scanning (ms): Min: 6.0, Avg: 6.9, Max: 7.7, Diff: 1.7, > Sum: 13.7] > [Update RS (ms): Min: 20.4, Avg: 21.3, Max: 22.1, Diff: 1.7, Sum: > 42.6] > [Processed Buffers: Min: 49, Avg: 60.0, Max: 71, Diff: 22, Sum: > 120] > [Scan RS (ms): Min: 23.2, Avg: 23.2, Max: 23.3, Diff: 0.1, Sum: 46.5= ] > [Object Copy (ms): Min: 112.3, Avg: 112.3, Max: 112.4, Diff: 0.1, > Sum: 224.6] > [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.1= ] > [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: > 0.1] > [GC Worker Total (ms): Min: 163.8, Avg: 163.8, Max: 163.8, Diff: 0.0= , > Sum: 327.6] > [GC Worker End (ms): Min: 222346382.1, Avg: 222346382.1, Max: > 222346382.1, Diff: 0.0] > [Code Root Fixup: 0.0 ms] > [Clear CT: 0.4 ms] > [Other: 2.1 ms] > [Choose CSet: 0.0 ms] > [Ref Proc: 1.1 ms] > [Ref Enq: 0.0 ms] > [Free CSet: 0.4 ms] > [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68.0M Heap: > 3518.5M(8192.0M)->3018.5M(8192.0M)] > Heap after GC invocations=3D4074 (full 1): > garbage-first heap total 8388608K, used 3090914K [0x00000005f5c00000, > 0x00000007f5c00000, 0x00000007f5c00000) > region size 4096K, 17 young (69632K), 17 survivors (69632K) > compacting perm gen total 28672K, used 27428K [0x00000007f5c00000, > 0x00000007f7800000, 0x0000000800000000) > the space 28672K, 95% used [0x00000007f5c00000, 0x00000007f76c9108, > 0x00000007f76c9200, 0x00000007f7800000) > No shared spaces configured. > } > [Times: user=3D0.35 sys=3D0.00, real=3D27.58 secs] > 222346.219: G1IncCollectionPause [ 111 0 > 0 ] [ 0 0 0 0 27586 ] 0 > > And the total thime for which application threads were stopped is 27.58 > seconds. > > CMS behaves in a similar manner. We thought it would be GC, waiting for > mmaped files being read from disk (the thread cannot reach safepoint duri= ng > this operation), but it doesn't explain the huge time. > > We'll try jhiccup to see if it provides any additional information. The > test was done on mixed aws/openstack environment, openjdk 1.7.0_45, > cassandra 1.2.11. Upgrading to 2.0.x is no option for us. > > regards, > > ondrej cernos > > > On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng wrote: > >> Sorry, I have not had a chance to file a JIRA ticket. We have not been >> able to resolve the issue. But since Joel mentioned that upgrading to >> Cassandra 2.0.X solved it for them, we may need to upgrade. We are >> currently on Java 1.7 and Cassandra 1.2.8 >> >> >> >> On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright wro= te: >> >>> You=E2=80=99re running 2.0.* in production? May I ask what C* version = and OS? >>> Any hardware details would be appreciated as well. Thx! >>> >>> From: Joel Samuelsson >>> Reply-To: "user@cassandra.apache.org" >>> Date: Thursday, February 13, 2014 at 11:39 AM >>> >>> To: "user@cassandra.apache.org" >>> Subject: Re: Intermittent long application pauses on nodes >>> >>> We have had similar issues and upgrading C* to 2.0.x and Java to 1.7 >>> seems to have helped our issues. >>> >>> >>> 2014-02-13 Keith Wright : >>> >>>> Frank did you ever file a ticket for this issue or find the root cause= ? >>>> I believe we are seeing the same issues when attempting to bootstrap. >>>> >>>> Thanks >>>> >>>> From: Robert Coli >>>> Reply-To: "user@cassandra.apache.org" >>>> Date: Monday, February 3, 2014 at 6:10 PM >>>> To: "user@cassandra.apache.org" >>>> Subject: Re: Intermittent long application pauses on nodes >>>> >>>> On Mon, Feb 3, 2014 at 8:52 AM, Benedict Elliott Smith < >>>> belliottsmith@datastax.com> wrote: >>>> >>>>> >>>>> It's possible that this is a JVM issue, but if so there may be some >>>>> remedial action we can take anyway. There are some more flags we shou= ld >>>>> add, but we can discuss that once you open a ticket. If you could inc= lude >>>>> the strange JMX error as well, that might be helpful. >>>>> >>>> >>>> It would be appreciated if you could inform this thread of the JIRA >>>> ticket number, for the benefit of the community and google searchers. = :) >>>> >>>> =3DRob >>>> >>> >>> >> > --90e6ba6e863a36d36304f297d940 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Ondrej,

It seems like your issue is muc= h less difficult to diagnose: your collection times are long. At least, the= pause you printed the time for is all attributable to the G1 pause.

Note that G1 has not generally performed well with Cass= andra in our testing. There are a number of changes going in soon that may = change that, but for the time being it is advisable to stick with CMS. With= tuning you can no doubt bring your pauses down considerably.


On 17 F= ebruary 2014 10:17, Ond=C5=99ej =C4=8Cerno=C5=A1 <cernoso@gmail.com>= ; wrote:
Hi all,

we are seeing the same kind of = long pauses in Cassandra. We tried to switch CMS to G1 without positive res= ult. The stress test is read heavy, 2 datacenters, 6 nodes, 400reqs/sec on = one datacenter. We see spikes in latency on 99.99 percentil and higher, cau= sed by threads being stopped in JVM.

The GC in G1 looks like this:

=
{Heap before GC invocations=3D4073 (full 1):
garbage-first heap =C2=A0 total 8388608K, used 3602914K [0x00000005f5c00000= , 0x00000007f5c00000, 0x00000007f5c00000)
=C2=A0region size 4096K, 142 y= oung (581632K), 11 survivors (45056K)
compacting perm gen =C2=A0total 28= 672K, used 27428K [0x00000007f5c00000, 0x00000007f7800000, 0x00000008000000= 00)
=C2=A0 the space 28672K, =C2=A095% used [0x00000007f5c00000, 0x00000007f76c= 9108, 0x00000007f76c9200, 0x00000007f7800000)
No shared spaces configure= d.
= 2014-02-17T04:44:16.385+0100: 22234= 6.218: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 37748736 bytes, new = threshold 15 (max 15)
- age =C2= =A0 1: =C2=A0 17213632 bytes, =C2=A0 17213632 total
- age =C2=A0 2: =C2=A0 19391208 bytes, =C2=A0 36604840 = total
, 0.1664300 secs]
=C2=A0 [Parallel Time: 163.9 ms, GC Workers: 2]=C2=A0 =C2=A0 =C2=A0[GC Worker Start (ms= ): Min: 222346218.3, Avg: 222346218.3, Max: 222346218.3, Diff: 0.0]<= br> =C2=A0 =C2=A0 =C2=A0[Ext Root Scanning (ms= ): Min: 6.0, Avg: 6.9, Max: 7.7, Diff: 1.7, Sum: 13.7]
=C2=A0 =C2=A0 =C2=A0[Update RS (ms): Min: 20.4, Avg:= 21.3, Max: 22.1, Diff: 1.7, Sum: 42.6]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 [Processed Buf= fers: Min: 49, Avg: 60.0, Max: 71, Diff: 22, Sum: 120]
=C2=A0 =C2=A0 =C2=A0[Scan RS (ms): Min: 23.2, Avg: 2= 3.2, Max: 23.3, Diff: 0.1, Sum: 46.5]
=C2=A0 =C2=A0 =C2=A0[Object Copy (ms): Min= : 112.3, Avg: 112.3, Max: 112.4, Diff: 0.1, Sum: 224.6]
=C2=A0 =C2=A0 =C2=A0[Termination (ms): Min: 0.0, Av= g: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.1]
=C2=A0 =C2=A0 =C2=A0[GC Worker Other (ms):= Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
=C2=A0 =C2=A0 =C2=A0[GC Worker Total (ms): Min: 163.8, = Avg: 163.8, Max: 163.8, Diff: 0.0, Sum: 327.6]
=C2=A0 =C2=A0 =C2=A0[GC Worker End (ms): M= in: 222346382.1, Avg: 222346382.1, Max: 222346382.1, Diff: 0.0]
<= font face=3D"arial, sans-serif">=C2=A0 [Code Root Fixup: 0.0 ms]
= =C2=A0 [Clear CT: 0.4 ms]
=C2=A0 [Other: 2.1 ms]
=C2=A0 =C2=A0 =C2=A0[Choose CSet: 0.0 ms]=C2=A0 =C2=A0 =C2=A0[Ref Proc: 1.1 ms]
=C2=A0 =C2=A0 =C2=A0[Ref Enq: 0.0 = ms]
=C2=A0 =C2=A0 =C2=A0[Free CSet: 0.4 ms]
=C2=A0 [Eden: 524.0M(524.0M)->0.0B(476.0M) Survivors: 44.0M->68= .0M Heap: 3518.5M(8192.0M)->3018= .5M(8192.0M)]
Heap after GC invocations=3D4074 (full 1):

garbage-first heap =C2=A0 total 8388608K, used 3090914K [0x0000000= 5f5c00000, 0x00000007f5c00000, 0x00000007f5c00
000)
=C2=A0region size 4096K, 17 young (69632K)= , 17 survivors (69632K)
compacti= ng perm gen =C2=A0total 28672K, used 27428K [0x00000007f5c00000, 0x00000007f7800000, 0x0000000800000000)<= /font>
=C2=A0 the space 28672K, =C2=A095% used [0= x00000007f5c00000, 0x00000007f76c9<= /font>108, 0x00000007f76c9200, 0x00000007f7800000)
No shared spaces configured.
}

= [Times: user=3D0.35 sys=3D0.00, real=3D27.58 secs]=C2=A0
<= /div>
222346.219: G1IncCollectionPause =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 [ =C2=A0 =C2=A0 111 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0] =C2=A0 =C2= =A0 =C2=A0[ =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 0= 27586 =C2=A0 =C2=A0] =C2=A00

And the total thime for which application threads were stopp= ed is 27.58 seconds.

CMS behaves in a similar manner. We th= ought it would be GC, waiting for mmaped files being read from disk (the th= read cannot reach safepoint during this operation), but it doesn't expl= ain the huge time.

We'll try jhiccup to see if it provides any additi= onal information. The test was done on mixed aws/openstack environment, ope= njdk=C2=A01.7.0_45, cassandra 1.2.11. Upgrading to 2.0.x is no option for u= s.

regards,

ondrej cer= nos
<= /div>


On Fri, Feb 14, 2014 at 8:53 PM, Frank Ng <fntemk@gmail.com> = wrote:
Sorry, I have not had a chance to file a JIRA ticket.=C2= =A0 We have not been able to resolve the issue.=C2=A0 But since Joel mentio= ned that upgrading to Cassandra 2.0.X solved it for them, we may need to up= grade.=C2=A0 We are currently on Java 1.7 and Cassandra 1.2.8



On Thu, Feb 13, 2014 at 12:40 PM, Keith Wright <= ;kwright@nanigans= .com> wrote:
You=E2=80=99re running 2.0.* in production? =C2=A0May I ask wha= t C* version and OS? =C2=A0Any hardware details would be appreciated as wel= l. =C2=A0Thx!

From: Joel Samuelsson <samuelsson.joel@gma= il.com>
Reply-To: "= user@cassand= ra.apache.org" <user@cassandra.apache.org>
Date: Thursday, February 13, 2014 = at 11:39 AM

To: "user@cassandra.a= pache.org" <user@cassandra.apache.org>
Subject: Re: Intermittent long app= lication pauses on nodes

We have had similar issues and upgrading C* to 2.0.x and Java t= o 1.7 seems to have helped our issues.


2014-02-13 Ke= ith Wright <kwright@nanigans.com>:
Frank did you ever file a ticket for this issue or find the roo= t cause? =C2=A0I believe we are seeing the same issues when attempting to b= ootstrap.

Thanks

From: Robert Coli <rcoli@eventbrite.com>Reply-To: "user@cassandra.apache.org= " <u= ser@cassandra.apache.org>
Date: Monday, February 3, 2014 at 6= :10 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Intermittent long appl= ication pauses on nodes

On Mon, Feb= 3, 2014 at 8:52 AM, Benedict Elliott Smith <belliot= tsmith@datastax.com> wrote:

It's possible that this is a JVM issue, but if so t= here may be some remedial action we can take anyway. There are some more fl= ags we should add, but we can discuss that once you open a ticket. If you c= ould include the strange JMX error as well, that might be helpful.

It would be apprec= iated if you could inform this thread of the JIRA ticket number, for the be= nefit of the community and google searchers. :)

=3DRob=C2=A0




--90e6ba6e863a36d36304f297d940--