cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anuj Wadehra (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-8907) Raise GCInspector alerts to WARN
Date Sat, 05 Sep 2015 13:25:46 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731797#comment-14731797
] 

Anuj Wadehra edited comment on CASSANDRA-8907 at 9/5/15 1:25 PM:
-----------------------------------------------------------------

[~johnny15676] Got your point !!! I think there are 2 scenarios: 

1. Suppose we enable gc pause warn limit by default and set it to some value say 200ms. Our
guess may be wrong as an application could be comfortable with 200ms gc pause. Now, if someone
upgrades to minor version then they will 'break' their log monitoring/warning system (as you
said) as new Warnings for gc pauses greater than 200 ms are undesirable.

2. Suppose a user's nodes are getting down intermittently due to long GC pauses (20+ secs)
but their log monitoring Warning system is comfortable and not  reporting any issue. This
is a BUG. Now, if such users upgrade with gc pause warn limit enabled by default and set to
a much higher value say 20000ms and they start getting these Warnings in case they get adhoc
huge gc pauses over 20 sec. I wont call it 'breaking' their log monitoring system as it is
a serious issue else their nodes will go DOWN intermittently without raising Warnings.

So, I advocate targetting scenario 2, where this property is enabled by default and set to
a very high value (20000+ ms) . This way we are not trying guess what gc pause are Ok for
a user application. Whether 100ms or 200ms or 1000ms is comfortable for an application . But,
at the same time we raise the warning, when there is a serious gc pause which may cause a
node being marked down. A user must get Warnings if such huge gc pauses are happening.

Any user upgrading to minor version will have the option to decrease the value based on his
application requirements or leave it as it is. 

Moreover, If you agree with my above mentioned opinion, I would suggest that tpstats should
be logged at  min(1000ms,gc warn threshold e.g. 20000). If user application is sensitive to
gc pauses, he will reduce the gc warn threshold to a lower value e.g. 100 ms and then he will
see diagnostic tpstats info every time a gc pause over 100ms occurs. If user doesnt change
the HUGE default gc warn limit (20000+) enabled by default, we would stick to existing way
i.e. dump tpstats at gc pauses more than 1000ms to avoid breaking existing way of dumping
tpstats.   

Small concern so we can quickly discuss and close it :)


was (Author: eanujwa):
[~johnny15676] Got your point !!! I think there are 2 scenarios: 

1. Suppose I set it to any value say 200ms and my application was comfortable with it. Now,
if I upgrade to minor version then I will 'break' my warning system (as you said) as Warnings
were undesirable.

2. My nodes are getting down intermittently due to long GC pauses (20+ secs) but my Warning
system is comfortable and not  reporting any issue. This is a BUG. Now, if I upgrade with
a default value of this property set to 20000ms and I start getting these Warnings. I wont
call it 'breaking' my warning system as it is a serious issue else my nodes will go DOWN intermittently
without raising Warnings.

So, I advocate targetting scenario 2, where this property is enabled by default and set to
unreasonably high value (20000+ ms) so that I dont break existing warning systems (as I cant
guess whether 100ms or 200ms or 1000ms is comfortable for an application) . But, at the same
time I raise the warning, when there is a serious chance of node being marked down.

Any user upgrading to minor version will have the option to decrease the value based on his
application requirements or leave it as it is. 

Moreover, If you agree with my above mentioned opinion, I would suggest that tpstats should
be logged at  min(1000ms,gc warn threshold). If user is sensitive to gc pauses, he will reduce
the gc warn threshold to a lower value e.g. 100 ms and then he would like to see diagnostic
tpstats info every time a gc pause over 100ms occurs. If user doesnt change the HUGE default
gc warn limit (20000+), we would stick to existing way i.e. dump tpstats at gc pauses more
than 1000ms to avoid breaking existing way of dumping tpstats.   

Small concern so we can quickly discuss and close it :)

> Raise GCInspector alerts to WARN
> --------------------------------
>
>                 Key: CASSANDRA-8907
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8907
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Adam Hattrell
>            Assignee: Amit Singh Chowdhery
>              Labels: patch
>         Attachments: cassnadra-8907.patch
>
>
> I'm fairly regularly running into folks wondering why their applications are reporting
down nodes.  Yet, they report, when they grepped the logs they have no WARN or ERRORs listed.
> Nine times out of ten, when I look through the logs we see a ton of ParNew or CMS gc
pauses occurring similar to the following:
> INFO [ScheduledTasks:1] 2013-03-07 18:44:46,795 GCInspector.java (line 122) GC for ConcurrentMarkSweep:
1835 ms for 3 collections, 2606015656 used; max is 10611589120
> INFO [ScheduledTasks:1] 2013-03-07 19:45:08,029 GCInspector.java (line 122) GC for ParNew:
9866 ms for 8 collections, 2910124308 used; max is 6358564864
> To my mind these should be WARN's as they have the potential to be significantly impacting
the clusters performance as a whole.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message