Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 70B03F28C for ; Wed, 20 Mar 2013 16:27:18 +0000 (UTC) Received: (qmail 77732 invoked by uid 500); 20 Mar 2013 16:27:16 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 77622 invoked by uid 500); 20 Mar 2013 16:27:16 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 77183 invoked by uid 99); 20 Mar 2013 16:27:16 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Mar 2013 16:27:16 +0000 Date: Wed, 20 Mar 2013 16:27:16 +0000 (UTC) From: "Brooke Bryan (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-5367) Hints stuck on compaction MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brooke Bryan updated CASSANDRA-5367: ------------------------------------ Attachment: thread.log > Hints stuck on compaction > ------------------------- > > Key: CASSANDRA-5367 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5367 > Project: Cassandra > Issue Type: Bug > Affects Versions: 1.2.2 > Environment: 80 Node cluster on 1.2.2 (problem has been around since before 1.0) > Reporter: Brooke Bryan > Attachments: thread.log > > > When our cluster is handling hints, we will very often see hints get stuck on nodes if it is unable to communicate with another node. The problem is not that the other node is down, the other node will be sat doing compactions, or running out of memory. While that node is a problem, and needs to be fixed, all other nodes on the cluster will stick waiting to handle hints between that node and itself. > This causes a pretty major knock on effect throughout the entire cluster, causing hints to back up. We are seeing some nodes backed up with 14GB of hints, after 2 days of the hints being stuck. > Also, during this "stuck" session, compactionstats will show a compaction on the system hints column family, and not change the completed bytes amount. > This is the only reason for an entire cluster to get very bogged down from what I have experienced, and requires a lot of manual intervention to get everything back online. > After putting a node into debug mode, I have narrowed down the issue to be within: > startColumn = hint.name(); (line ~361 HintedHandoffManager) and line 390 > based on the log output, and through pausing handoffs etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira