Return-Path: Delivered-To: apmail-incubator-cassandra-commits-archive@minotaur.apache.org Received: (qmail 83001 invoked from network); 19 Feb 2010 03:20:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Feb 2010 03:20:50 -0000 Received: (qmail 38601 invoked by uid 500); 19 Feb 2010 03:20:50 -0000 Delivered-To: apmail-incubator-cassandra-commits-archive@incubator.apache.org Received: (qmail 38532 invoked by uid 500); 19 Feb 2010 03:20:49 -0000 Mailing-List: contact cassandra-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-commits@incubator.apache.org Received: (qmail 38522 invoked by uid 99); 19 Feb 2010 03:20:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Feb 2010 03:20:49 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Feb 2010 03:20:47 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id DB58129A0016 for ; Thu, 18 Feb 2010 19:20:27 -0800 (PST) Message-ID: <1708016763.377231266549627897.JavaMail.jira@brutus.apache.org> Date: Fri, 19 Feb 2010 03:20:27 +0000 (UTC) From: "Ryan King (JIRA)" To: cassandra-commits@incubator.apache.org Subject: [jira] Created: (CASSANDRA-809) Full disk can result in being marked down MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Full disk can result in being marked down ----------------------------------------- Key: CASSANDRA-809 URL: https://issues.apache.org/jira/browse/CASSANDRA-809 Project: Cassandra Issue Type: Bug Affects Versions: 0.5, 0.6, 0.7 Reporter: Ryan King We had a node file up the disk under one of two data directories. The result was that the node stopped making progress. The problem appears to be this (I'll update with more details as we find them): When new tasks are put onto most queues in Cassandra, if there isn't a thread in the pool to handle the task immediately, the task in run in the caller's thread (org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the caller-runs policy). The queue in question here is the queue that manages flushes, which is enqueued to from various places in our code (and therefore likely from multiple threads). Assuming that the full disk meant that no threads doing flushing could make progress (it appears that way) eventually any thread that calls the flush code would become stalled. Assuming our analysis is right (and we're still looking into it) we need to make a change. Here's a proposal so far: SHORT TERM: * change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled LONG TERM * It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory * Perhaps we could use the failure detector on disks? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.