cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Cheng <br...@blockcypher.com>
Subject Re: Cassandra compaction appears to stall, node becomes partially unresponsive
Date Wed, 22 Jul 2015 22:35:49 GMT
Hi Aiman,

We previously had issues with GC, but since upgrading to 2.1.7 things seem
a lot healthier.

We collect GC statistics through collectd via the garbage collector mbean,
ParNew GC's report sub 500ms collection time on average (I believe
accumulated per minute?) and CMS peaks at about 300ms collection time when
it runs.

On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz <aiman@flipagram.com> wrote:

> Hi Bryan
> How's GC behaving on these boxes?
>
> On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <bryan@blockcypher.com>
> wrote:
>
>> Hi there,
>>
>> Within our Cassandra cluster, we're observing, on occasion, one or two
>> nodes at a time becoming partially unresponsive.
>>
>> We're running 2.1.7 across the entire cluster.
>>
>> nodetool still reports the node as being healthy, and it does respond to
>> some local queries; however, the CPU is pegged at 100%. One common thread
>> (heh) each time this happens is that there always seems to be one of more
>> compaction threads running (via nodetool tpstats), and some appear to be
>> stuck (active count doesn't change, pending count doesn't decrease). A
>> request for compactionstats hangs with no response.
>>
>> Each time we've seen this, the only thing that appears to resolve the
>> issue is a restart of the Cassandra process; the restart does not appear to
>> be clean, and requires one or more attempts (or a -9 on occasion).
>>
>> There does not seem to be any pattern to what machines are affected; the
>> nodes thus far have been different instances on different physical machines
>> and on different racks.
>>
>> Has anyone seen this before? Alternatively, when this happens again, what
>> data can we collect that would help with the debugging process (in addition
>> to tpstats)?
>>
>> Thanks in advance,
>>
>> Bryan
>>
>
>
>
> --
> *Aiman Parvaiz*
> Lead Systems Architect
> aiman@flipagram.com
> cell: 213-300-6377
> http://flipagram.com/apz
>

Mime
View raw message