cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-7275) Errors in FlushRunnable may leave threads hung
Date Wed, 17 Dec 2014 11:17:14 GMT


Sylvain Lebresne commented on CASSANDRA-7275:

The current behavior is that an unexpected flush error blocks any flush thereon. It does seems
to me that changing it so that it blocks only flushes for the column family on which there
was a problem (which is not exactly what the patch does, and I do agree with Benedict that
it does need to do that) is an improvement: if the problem happens for every CF then we're
no worst than currently, but if it's a one-time event it might leave time for operators to
take proper actions (of course, we should log a scary error, it's not something that should
be ignored). So maybe we can start there since we don't seem to agree on whether crashing
the node is an even better improvement?

As far as my own opinion goes, I do am not in favor of crashing in that case because again,
if you hold enough memtables in memory that your node become unresponsive, you're not really
worth off that if you had crashed it right away, but if the problem ends up impacting a low
traffic table (for instance a system table), you might be able to fix the problem in a way
that is less impactful for your cluster.

I'll note however that I would agree that if the error is a IO one, we should respect the
disk_failure_policy. And I don't know, maybe we need another failure policy (best_effort/crash)
for unexpected errors (aka bugs) that have the potential of destabilizing a node (I would
agree that adding this is pushing the problem to our users, but it appears not everyone has
the same idea on what is the best strategy, and there is maybe not a single good answer).

> Errors in FlushRunnable may leave threads hung
> ----------------------------------------------
>                 Key: CASSANDRA-7275
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Tyler Hobbs
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>             Fix For: 2.0.12
>         Attachments: 0001-Move-latch.countDown-into-finally-block.patch, 7252-2.0-v2.txt,
> In Memtable.FlushRunnable, the CountDownLatch will never be counted down if there are
errors, which results in hanging any threads that are waiting for the flush to complete. 
For example, an error like this causes the problem:
> {noformat}
> ERROR [FlushWriter:474] 2014-05-20 12:10:31,137 (line 198) Exception
in thread Thread[FlushWriter:474,5,main]
> java.lang.IllegalArgumentException
>     at java.nio.Buffer.position(Unknown Source)
>     at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(
>     at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(
>     at org.apache.cassandra.db.marshal.AbstractCompositeType.split(
>     at
>     at org.apache.cassandra.db.ColumnFamily.getColumnStats(
>     at
>     at org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(
>     at org.apache.cassandra.db.Memtable$FlushRunnable.runWith(
>     at
>     at
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     at java.util.concurrent.ThreadPoolExecutor$ Source)
>     at Source)
> {noformat}

This message was sent by Atlassian JIRA

View raw message