cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-2118) Provide failure modes if issues with the underlying filesystem of a node
Date Tue, 14 Aug 2012 22:15:38 GMT


Jonathan Ellis updated CASSANDRA-2118:

    Attachment: 2118-tweaked.txt

Looks reasonable.  Tweaked version attached w/ some minor cleanup.

Other things worth addressing:
- Is there a reason for the FSError.Op enum?  Looks like we don't need it if we just use instanceof
instead in handleFSError.
- Instead of trying to catch all the places we iterate sstables, what about either (1) removing
unreadable sstables in DataTracker.get[Uncompacting]SSTables or (2) ripping them out of DataTracker
when we handle the error?  Either of those seems more foolproof to me.
- Would be nice to persist the blacklisted sstables somehow.  Maybe write a copy to each (other)
data directory, so we don't try to read sstables that we've blacklisted, after a restart?
- May be worth adding another option: best_effort_with_repair, where when we detect an unreadable
disk we kick off a repair to rebuild that data automatically.
> Provide failure modes if issues with the underlying filesystem of a node
> ------------------------------------------------------------------------
>                 Key: CASSANDRA-2118
>                 URL:
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Chris Goffinet
>            Assignee: Aleksey Yeschenko
>             Fix For: 1.2
>         Attachments: 0001-Provide-failure-modes-if-issues-with-the-underlying-.patch,
0001-Provide-failure-modes-if-issues-with-the-underlying-v2.patch, 0001-Provide-failure-modes-if-issues-with-the-underlying-v3.patch,
2118-tweaked.txt, CASSANDRA-2118-part1.patch, CASSANDRA-2118-v1.patch
> CASSANDRA-2116 introduces the ability to detect FS errors. Let's provide a mode in cassandra.yaml
so operators can decide that in the event of failure what to do:
> 1) standard - means continue on all errors (default)
> 2) read - means only stop  gossip/rpc server if 'reads' fail from drive, writes can fail
but not kill gossip/rpc server
> 3) readwrite - means stop gossip/rpc server if any read or write errors.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message