cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aiman Parvaiz <ai...@flipagram.com>
Subject Re: Cassandra compaction appears to stall, node becomes partially unresponsive
Date Wed, 22 Jul 2015 23:56:48 GMT
I faced something similar in past and the reason for nodes becoming unresponsive intermittently
was Long GC pauses. That's why I wanted to bring this to your attention incase GC pause is
a potential cause.

Sent from my iPhone

> On Jul 22, 2015, at 4:32 PM, Bryan Cheng <bryan@blockcypher.com> wrote:
> 
> Aiman,
> 
> Your post made me look back at our data a bit. The most recent occurrence of this incident
was not preceded by any abnormal GC activity; however, the previous occurrence (which took
place a few days ago) did correspond to a massive, order-of-magnitude increase in both ParNew
and CMS collection times which lasted ~17 hours.
> 
> Was there something in particular that links GC to these stalls? At this point in time,
we cannot identify any particular reason for either that GC spike or the subsequent apparent
compaction stall, although it did not seem to have any effect on our usage of the cluster.
> 
>> On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng <bryan@blockcypher.com> wrote:
>> Hi Aiman,
>> 
>> We previously had issues with GC, but since upgrading to 2.1.7 things seem a lot
healthier.
>> 
>> We collect GC statistics through collectd via the garbage collector mbean, ParNew
GC's report sub 500ms collection time on average (I believe accumulated per minute?) and CMS
peaks at about 300ms collection time when it runs.
>> 
>>> On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz <aiman@flipagram.com> wrote:
>>> Hi Bryan
>>> How's GC behaving on these boxes?
>>> 
>>>> On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <bryan@blockcypher.com>
wrote:
>>>> Hi there,
>>>> 
>>>> Within our Cassandra cluster, we're observing, on occasion, one or two nodes
at a time becoming partially unresponsive.
>>>> 
>>>> We're running 2.1.7 across the entire cluster.
>>>> 
>>>> nodetool still reports the node as being healthy, and it does respond to
some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time
this happens is that there always seems to be one of more compaction threads running (via
nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count
doesn't decrease). A request for compactionstats hangs with no response.
>>>> 
>>>> Each time we've seen this, the only thing that appears to resolve the issue
is a restart of the Cassandra process; the restart does not appear to be clean, and requires
one or more attempts (or a -9 on occasion).
>>>> 
>>>> There does not seem to be any pattern to what machines are affected; the
nodes thus far have been different instances on different physical machines and on different
racks.
>>>> 
>>>> Has anyone seen this before? Alternatively, when this happens again, what
data can we collect that would help with the debugging process (in addition to tpstats)?
>>>> 
>>>> Thanks in advance,
>>>> 
>>>> Bryan
>>> 
>>> 
>>> 
>>> -- 
>>> Aiman Parvaiz
>>> Lead Systems Architect
>>> aiman@flipagram.com
>>> cell: 213-300-6377
>>> http://flipagram.com/apz
> 

Mime
View raw message