cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-2405) should expose 'time since last successful repair' for easier aes monitoring
Date Mon, 18 Apr 2011 17:18:06 GMT


Peter Schuller commented on CASSANDRA-2405:

A further complication: Since the intent here is to enable people to set up alarms to trigger
whenever the time-since-last is not within an acceptable range, it raises the issue of whether
to keep this information persistent in system tables or just in-memory. Keeping in mind that:

(1) For large amounts of data the act of doing another round of AES "just in case" if a node
was restarted is significant
(2) If the alarm were to triggered on the information not being available, that would instantly
lead to false positive alarms when nodes are restarted, instantly rendering alarms useless
to operations.
(3) If the alarm were to ignore the case where the information is not yet available, that
is a very dangerous silent failure and effectively means the alarm is not functioning properly.

... I get the feeling one wants this information persistent.

I guess this all makes the ticket non-trivial, but I think the need for an "easy" way for
operators to ensure sufficient AES frequency is important.

(I'm actually kind of surprised issues with this do not crop up more often on the mailing
lists... am I missing something that mitigates the impact here, or are people just using sufficiently
long grace periods relative to repair frequency that they're not hitting these things in practice?)

> should expose 'time since last successful repair' for easier aes monitoring
> ---------------------------------------------------------------------------
>                 Key: CASSANDRA-2405
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>             Fix For: 0.7.5
>         Attachments: CASSANDRA-2405.patch
> The practical implementation issues of actually ensuring repair runs is somewhat of an
undocumented/untreated issue.
> One hopefully low hanging fruit would be to at least expose the time since last successful
repair for a particular column family, to make it easier to write a correct script to monitor
for lack of repair in a non-buggy fashion.

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message