Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CAAA2D4F4 for ; Fri, 6 Jul 2012 19:09:54 +0000 (UTC) Received: (qmail 82616 invoked by uid 500); 6 Jul 2012 19:09:52 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 82597 invoked by uid 500); 6 Jul 2012 19:09:52 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 82589 invoked by uid 99); 6 Jul 2012 19:09:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Jul 2012 19:09:52 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a41.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Jul 2012 19:09:45 +0000 Received: from homiemail-a41.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a41.g.dreamhost.com (Postfix) with ESMTP id DEC3B44C058 for ; Fri, 6 Jul 2012 12:09:21 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=TmFqQxnImp S7V6iYnIXepnuNwTAHUqCKMGzNkr+QMIv+ME8mQkWAbO8MwHtVjvMj8wbhss8aq1 06N/gBiPMSE0HEFrcr+YjyhYGS6G/HM8q21vj2DdL3CdZ43B1g58STEXC8ezIDvk UQDh5lojLfJ26pZf4KBpU4cBOX2mX+GTI= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=1VuQEFnE2egw/bfR MGsgo1p5pd4=; b=jx26oTxfngP9PcAENOfbPD+0J5fzlGeT0LkXUkZSoCo8wIu8 ZZiv/JrYSHTPq0iNsbBXrME6z14XVAKPwFhE88nZqWX+Dco8dzkKzwGlKl0v+DX8 YPtdlLvZyaFiGvcFQYWSSttsjk5ppoIKvem0rgARLhJGS1lhrMyojh11rv0= Received: from [172.16.1.4] (unknown [203.86.207.101]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a41.g.dreamhost.com (Postfix) with ESMTPSA id 2EC5B44C057 for ; Fri, 6 Jul 2012 12:09:20 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1278) Content-Type: multipart/alternative; boundary="Apple-Mail=_CAB8607F-8B68-4874-9E8D-EFB8343A535F" Subject: Re: frequent node up/downs Date: Sat, 7 Jul 2012 07:09:18 +1200 In-Reply-To: To: user@cassandra.apache.org References: <20120702121753.2A63.C3984673@terra.com.br> <09277B7E-4ADE-4FDE-99E7-ED386CCFADC4@thelastpickle.com> Message-Id: <102464DA-2C4B-4E58-84E2-C48572C9F2DF@thelastpickle.com> X-Mailer: Apple Mail (2.1278) --Apple-Mail=_CAB8607F-8B68-4874-9E8D-EFB8343A535F Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 > It looks like this happens when there is a promotion failure.=20 Java Heap is full.=20 Memory is fragmented.=20 Use C for web scale.=20 > Also is it normal to see the "Heap is xx full. You may need to reduce = memtable and/or cache sizes" message quite often? I haven't turned on = row caches or changed any default memtable size settings so I am = wondering why the old gen fills up. It's odd to get that out of the box with an 8GB heap on a 1.1.X install.=20= What sort of work load ? Is it under heavy inserts ? Do you have a lot of CF's ? A lot of secondary indexes ? After the messages is it able to reduce heap usage ? Does it seem to correlate to compactions ? Is the node able to get back to a healthy state ? If this is testing are you able to pull back to a workload where the = issues doe not appear ?=20 Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 7/07/2012, at 4:33 AM, feedly team wrote: > I reduced the load and the problem hasn't been happening as much. = After enabling gc logging, I see messages mentioning promotion failed = when the pauses happen. It looks like this happens when there is a = promotion failure. =46rom reading on the web it looks like I could try = reducing the CMSInitiatingOccupancyFraction value and/or decreasing the = young gen size to try to avoid this scenario. >=20 > Also is it normal to see the "Heap is xx full. You may need to reduce = memtable and/or cache sizes" message quite often? I haven't turned on = row caches or changed any default memtable size settings so I am = wondering why the old gen fills up. >=20 >=20 > On Wed, Jul 4, 2012 at 6:28 AM, aaron morton = wrote: >> What accounts for the much larger virtual number? some kind of = off-heap memory?=20 > http://wiki.apache.org/cassandra/FAQ#mmap >=20 >> I'm a little puzzled as to why I would get such long pauses without = swapping.=20 > The two are not related. On startup the JVM memory is locked so it = will not swap, from then on memory management is pretty much up the JVM.=20= >=20 > Getting a lot of ParNew activity does not mean the JVM is low on = memory, it means there is a lot of activity in the new heap.=20 >=20 > If you have a lot of insert activity (typically in a load test) you = can generate a lot of GC activity. Try reducing the load to a point = where it does not ht GC and then increase to find the cause. Also if you = can connect JConole to the JVM you may get a better view of the heap = usage. >=20 > Hope that helps.=20 >=20 > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com >=20 > On 3/07/2012, at 3:41 PM, feedly team wrote: >=20 >> Couple more details. I confirmed that swap space is not being used = (free -m shows 0 swap) and cassandra.log has a message like "JNA = mlockall successful". top shows the process having 9g in resident memory = but 21.6g in virtual...What accounts for the much larger virtual number? = some kind of off-heap memory?=20 >>=20 >> I'm a little puzzled as to why I would get such long pauses without = swapping. I uncommented all the gc logging options in cassandra-env.sh = to try to see what is going on when the node freezes. >>=20 >> Thanks >> Kireet >>=20 >> On Mon, Jul 2, 2012 at 9:51 PM, feedly team = wrote: >> Yeah I noticed the leap second problem and ran the suggested fix, but = I have been facing these problems before Saturday and still see the = occasional failures after running the fix.=20 >>=20 >> Thanks. >>=20 >>=20 >> On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both = wrote: >> Yeah! Look that. >> = http://arstechnica.com/business/2012/07/one-day-later-the-leap-second-v-th= e-internet-scorecard/ >> I had the same problem. The solution was rebooting. >>=20 >> On Mon, 2 Jul 2012 11:08:57 -0400 >> feedly team wrote: >>=20 >> > Hello, >> > I recently set up a 2 node cassandra cluster on dedicated = hardware. In >> > the logs there have been a lot of "InetAddress xxx is now dead' or = UP >> > messages. Comparing the log messages between the 2 nodes, they seem = to >> > coincide with extremely long ParNew collections. I have seem some = of up to >> > 50 seconds. The installation is pretty vanilla, I didn't change any >> > settings and the machines don't seem particularly busy - cassandra = is the >> > only thing running on the machine with an 8GB heap. The machine has = 64GB of >> > RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is = xxx >> > full. You may need to reduce memtable and/or cache sizes' messages. = Would >> > this help with the long ParNew collections? That message seems to = be >> > triggered on a full collection. >>=20 >> -- >> Marcus Both >>=20 >>=20 >>=20 >=20 >=20 --Apple-Mail=_CAB8607F-8B68-4874-9E8D-EFB8343A535F Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 It looks like this happens when there = is a promotion failure. 
Java Heap is = full. 
Memory is fragmented. 
Use C for = web scale. 

Also is it normal to see the "Heap is xx full. =  You may need to reduce memtable and/or cache sizes" message quite = often? I haven't turned on row caches or changed any default memtable = size settings so I am wondering why the old gen fills = up.

It's odd to get that out of the = box with an 8GB heap on a 1.1.X = install. 

What sort of work load ? Is it = under heavy inserts ?
Do you have a lot of CF's ? A lot of = secondary indexes ?
After the messages is it able to reduce = heap usage ?
Does it seem to correlate to compactions = ?
Is the node able to get back to a healthy state = ?
If this is testing are you able to pull back to a workload = where the issues doe not appear = ? 


Cheers

http://www.thelastpickle.com

On 7/07/2012, at 4:33 AM, feedly team wrote:

I reduced = the load and the problem hasn't been happening as much. After enabling = gc logging, I see messages mentioning promotion failed when the pauses = happen. It looks like this happens when there is a promotion failure. = =46rom reading on the web it looks like I could try reducing the = CMSInitiatingOccupancyFraction value and/or decreasing the young gen = size to try to avoid this scenario.

Also is it normal to see the "Heap is xx full.  You = may need to reduce memtable and/or cache sizes" message quite often? I = haven't turned on row caches or changed any default memtable size = settings so I am wondering why the old gen fills up.


On Wed, Jul 4, 2012 = at 6:28 AM, aaron morton <aaron@thelastpickle.com> = wrote:
What accounts for the much larger virtual number? some = kind of off-heap memory? 
http://wiki.apache.org/cassandra/FAQ#mmap

I'm a = little puzzled as to why I would get such long pauses without = swapping. 
The two are not related. On = startup the JVM memory is locked so it will not swap, from then on = memory management is pretty much up the JVM. 

Getting a lot of ParNew activity does not mean the = JVM is low on memory, it means there is a lot of activity in the new = heap. 

If you have a lot of insert = activity (typically in a load test) you can generate a lot of GC = activity. Try reducing the load to a point where it does not ht GC and = then increase to find the cause. Also if you can connect JConole to the = JVM you may get a better view of the heap usage.

Hope that = helps. 

-----------------
Aaron Morton
Freelance = Developer
@aaronmorton

On 3/07/2012, at 3:41 PM, feedly team = wrote:

Couple more details. I = confirmed that swap space is not being used (free -m shows 0 swap) and = cassandra.log has a message like "JNA mlockall successful". top shows = the process having 9g in resident memory but 21.6g in virtual...What = accounts for the much larger virtual number? some kind of off-heap = memory? 

I'm a little puzzled as to why I would get such long = pauses without swapping. I uncommented all the gc logging options in = cassandra-env.sh to try to see what is going on when the node = freezes.

Thanks
Kireet

On Mon, Jul 2, 2012 at 9:51 PM, feedly team <feedlydev@gmail.com> wrote:
Yeah I noticed the = leap second problem and ran the suggested fix, but I have been facing = these problems before Saturday and still see the occasional failures = after running the fix. 

Thanks.


On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both <mboth@terra.com.br> = wrote:
Yeah! Look that.
http://arstechnica.com/business/2012/07/one-day-later-th= e-leap-second-v-the-internet-scorecard/
I had the same problem. The solution was rebooting.

On Mon, 2 Jul 2012 11:08:57 -0400
feedly team <feedlydev@gmail.com> wrote:

> Hello,
>    I recently set up a 2 node cassandra cluster on = dedicated hardware. In
> the logs there have been a lot of "InetAddress xxx is now dead' or = UP
> messages. Comparing the log messages between the 2 nodes, they seem = to
> coincide with extremely long ParNew collections. I have seem some = of up to
> 50 seconds. The installation is pretty vanilla, I didn't change = any
> settings and the machines don't seem particularly busy - cassandra = is the
> only thing running on the machine with an 8GB heap. The machine has = 64GB of
> RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is = xxx
> full. You may need to reduce memtable and/or cache sizes' messages. = Would
> this help with the long ParNew collections? That message seems to = be
> triggered on a full collection.

--
Marcus Both



=



= --Apple-Mail=_CAB8607F-8B68-4874-9E8D-EFB8343A535F--