cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Capriolo (JIRA)" <>
Subject [jira] Reopened: (CASSANDRA-1776) Untrapped exceptions in ThreadPool have a variety of ill effects
Date Mon, 29 Nov 2010 17:19:11 GMT


Edward Capriolo reopened CASSANDRA-1776:

I may have explained poorly. On two occasions ~20 minutes after I see this in the logs Cassandra
on this node is at 100% user+system on all cores. The entire cluster quickly degrades. Many
pending messages in the Gossip stage and the entire cluster is 100% CPU on all cores. The
only course of action is to bring down the entire cluster, or if you catch the problem early
enough bring down multiple nodes at a time.

> Untrapped exceptions in ThreadPool have a variety of ill effects
> ----------------------------------------------------------------
>                 Key: CASSANDRA-1776
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.5
>            Reporter: Edward Capriolo
>         Attachments: logs
> I have seen a variety of conditions that keep the Cassandra process running even though
it mostly failed. At times the node stays up sending gossip messages so other nodes think
the node is up. In the worst case condition a node gets in a tight loop fully utilizing 16
cores of a system and sending gossip messages that cause cascading issues across the cluster.

> I have seen untrapped OOM errors.  The interesting part of the attached log is that we
are not using super columns. I also have machines that come up out of a 40 second garbage
collect, (I assume they gossip themselves as UP)  messages then go back into a garbage collect
to repeat again.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message