cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] Updated: (CASSANDRA-809) Full disk can result in being marked down
Date Fri, 02 Apr 2010 02:00:27 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jonathan Ellis updated CASSANDRA-809:
-------------------------------------

    Fix Version/s:     (was: 0.7)
                   0.8

> Full disk can result in being marked down
> -----------------------------------------
>
>                 Key: CASSANDRA-809
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-809
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ryan King
>             Fix For: 0.8
>
>
> We had a node file up the disk under one of two data directories. The result was that
the node stopped making progress. The problem appears to be this (I'll update with more details
as we find them):
> When new tasks are put onto most queues in Cassandra, if there isn't a thread in the
pool to handle the task immediately, the task in run in the caller's thread
> (org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the caller-runs
policy).  The queue in question here is the queue that manages flushes, which is enqueued
to from various places in our code (and therefore likely from multiple threads). Assuming
that the full disk meant that no threads doing flushing could make progress (it appears that
way) eventually any thread that calls the flush code would become stalled.
> Assuming our analysis is right (and we're still looking into it) we need to make a change.
Here's a proposal so far:
> SHORT TERM:
> * change the  TheadPoolExecutor policy to not be caller runs. This will let other threads
make progress in the event that one pool is stalled
> LONG TERM
> * It appears that there are n threads for n data directories that we flush to, but they're
not dedicated to a data directory. We should have a thread per data directory and have that
thread dedicated to that directory
> * Perhaps we could use the failure detector on disks?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message