Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Sat, 5 Sep 2015 13:25:46 +0000 (UTC)
From: "Anuj Wadehra (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12779530.1425492454000.264429.1441459546416@Atlassian.JIRA>
In-Reply-To: <JIRA.12779530.1425492454000@Atlassian.JIRA>
References: <JIRA.12779530.1425492454000@Atlassian.JIRA>
 <JIRA.12779530.1425492454095@arcas>
Subject: [jira] [Comment Edited] (CASSANDRA-8907) Raise GCInspector alerts
 to WARN
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731797#comment-14731797 ] 

Anuj Wadehra edited comment on CASSANDRA-8907 at 9/5/15 1:25 PM:
-----------------------------------------------------------------

[~johnny15676] Got your point !!! I think there are 2 scenarios: 

1. Suppose we enable gc pause warn limit by default and set it to some value say 200ms. Our guess may be wrong as an application could be comfortable with 200ms gc pause. Now, if someone upgrades to minor version then they will 'break' their log monitoring/warning system (as you said) as new Warnings for gc pauses greater than 200 ms are undesirable.

2. Suppose a user's nodes are getting down intermittently due to long GC pauses (20+ secs) but their log monitoring Warning system is comfortable and not  reporting any issue. This is a BUG. Now, if such users upgrade with gc pause warn limit enabled by default and set to a much higher value say 20000ms and they start getting these Warnings in case they get adhoc huge gc pauses over 20 sec. I wont call it 'breaking' their log monitoring system as it is a serious issue else their nodes will go DOWN intermittently without raising Warnings.

So, I advocate targetting scenario 2, where this property is enabled by default and set to a very high value (20000+ ms) . This way we are not trying guess what gc pause are Ok for a user application. Whether 100ms or 200ms or 1000ms is comfortable for an application . But, at the same time we raise the warning, when there is a serious gc pause which may cause a node being marked down. A user must get Warnings if such huge gc pauses are happening.

Any user upgrading to minor version will have the option to decrease the value based on his application requirements or leave it as it is. 

Moreover, If you agree with my above mentioned opinion, I would suggest that tpstats should be logged at  min(1000ms,gc warn threshold e.g. 20000). If user application is sensitive to gc pauses, he will reduce the gc warn threshold to a lower value e.g. 100 ms and then he will see diagnostic tpstats info every time a gc pause over 100ms occurs. If user doesnt change the HUGE default gc warn limit (20000+) enabled by default, we would stick to existing way i.e. dump tpstats at gc pauses more than 1000ms to avoid breaking existing way of dumping tpstats.   

Small concern so we can quickly discuss and close it :)


was (Author: eanujwa):
[~johnny15676] Got your point !!! I think there are 2 scenarios: 

1. Suppose I set it to any value say 200ms and my application was comfortable with it. Now, if I upgrade to minor version then I will 'break' my warning system (as you said) as Warnings were undesirable.

2. My nodes are getting down intermittently due to long GC pauses (20+ secs) but my Warning system is comfortable and not  reporting any issue. This is a BUG. Now, if I upgrade with a default value of this property set to 20000ms and I start getting these Warnings. I wont call it 'breaking' my warning system as it is a serious issue else my nodes will go DOWN intermittently without raising Warnings.

So, I advocate targetting scenario 2, where this property is enabled by default and set to unreasonably high value (20000+ ms) so that I dont break existing warning systems (as I cant guess whether 100ms or 200ms or 1000ms is comfortable for an application) . But, at the same time I raise the warning, when there is a serious chance of node being marked down.

Any user upgrading to minor version will have the option to decrease the value based on his application requirements or leave it as it is. 

Moreover, If you agree with my above mentioned opinion, I would suggest that tpstats should be logged at  min(1000ms,gc warn threshold). If user is sensitive to gc pauses, he will reduce the gc warn threshold to a lower value e.g. 100 ms and then he would like to see diagnostic tpstats info every time a gc pause over 100ms occurs. If user doesnt change the HUGE default gc warn limit (20000+), we would stick to existing way i.e. dump tpstats at gc pauses more than 1000ms to avoid breaking existing way of dumping tpstats.   

Small concern so we can quickly discuss and close it :)

> Raise GCInspector alerts to WARN
> --------------------------------
>
>                 Key: CASSANDRA-8907
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8907
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Adam Hattrell
>            Assignee: Amit Singh Chowdhery
>              Labels: patch
>         Attachments: cassnadra-8907.patch
>
>
> I'm fairly regularly running into folks wondering why their applications are reporting down nodes.  Yet, they report, when they grepped the logs they have no WARN or ERRORs listed.
> Nine times out of ten, when I look through the logs we see a ton of ParNew or CMS gc pauses occurring similar to the following:
> INFO [ScheduledTasks:1] 2013-03-07 18:44:46,795 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 1835 ms for 3 collections, 2606015656 used; max is 10611589120
> INFO [ScheduledTasks:1] 2013-03-07 19:45:08,029 GCInspector.java (line 122) GC for ParNew: 9866 ms for 8 collections, 2910124308 used; max is 6358564864
> To my mind these should be WARN's as they have the potential to be significantly impacting the clusters performance as a whole.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)