Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Tue, 31 Mar 2015 13:24:53 +0000 (UTC)
From: "Adam Hattrell (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12779530.1425492454000.80216.1427808293677@Atlassian.JIRA>
In-Reply-To: <JIRA.12779530.1425492454000@Atlassian.JIRA>
References: <JIRA.12779530.1425492454000@Atlassian.JIRA>
 <JIRA.12779530.1425492454095@arcas>
Subject: [jira] [Commented] (CASSANDRA-8907) Raise GCInspector alerts to
 WARN
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388510#comment-14388510 ] 

Adam Hattrell commented on CASSANDRA-8907:
------------------------------------------

Sorry only just come back to this.

I was only talking about those 200ms pauses.  Anything that large will be causing serious disruption to any enterprise level user.
To my mind these should be WARN level regardless of your environment.  They are "bad" by default.

I see some users that aim for a 5 ms SLA.  For them a 200ms pause is probably way to high - they would actually like to know about much smaller so a tunable level would actually be awesome.  

With regards to users reading their logs - most admins set alerts to fire when they get WARN and ERROR.  Having to write manual greps to try and pull out GCInspectors (or dropped mutations which is another bugbear) is a pita - and really shouldn't be necessary.


> Raise GCInspector alerts to WARN
> --------------------------------
>
>                 Key: CASSANDRA-8907
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8907
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Adam Hattrell
>
> I'm fairly regularly running into folks wondering why their applications are reporting down nodes.  Yet, they report, when they grepped the logs they have no WARN or ERRORs listed.
> Nine times out of ten, when I look through the logs we see a ton of ParNew or CMS gc pauses occurring similar to the following:
> INFO [ScheduledTasks:1] 2013-03-07 18:44:46,795 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 1835 ms for 3 collections, 2606015656 used; max is 10611589120
> INFO [ScheduledTasks:1] 2013-03-07 19:45:08,029 GCInspector.java (line 122) GC for ParNew: 9866 ms for 8 collections, 2910124308 used; max is 6358564864
> To my mind these should be WARN's as they have the potential to be significantly impacting the clusters performance as a whole.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)