Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Mon, 18 Apr 2011 17:18:06 +0000 (UTC)
From: "Peter Schuller (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: 
 <1500178936.64875.1303147086509.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <536685644.21759.1301506985705.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (CASSANDRA-2405) should expose 'time since last
 successful repair' for easier aes monitoring
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021122#comment-13021122 ] 

Peter Schuller commented on CASSANDRA-2405:
-------------------------------------------

A further complication: Since the intent here is to enable people to set up alarms to trigger whenever the time-since-last is not within an acceptable range, it raises the issue of whether to keep this information persistent in system tables or just in-memory. Keeping in mind that:

(1) For large amounts of data the act of doing another round of AES "just in case" if a node was restarted is significant
(2) If the alarm were to triggered on the information not being available, that would instantly lead to false positive alarms when nodes are restarted, instantly rendering alarms useless to operations.
(3) If the alarm were to ignore the case where the information is not yet available, that is a very dangerous silent failure and effectively means the alarm is not functioning properly.

... I get the feeling one wants this information persistent.

I guess this all makes the ticket non-trivial, but I think the need for an "easy" way for operators to ensure sufficient AES frequency is important.

(I'm actually kind of surprised issues with this do not crop up more often on the mailing lists... am I missing something that mitigates the impact here, or are people just using sufficiently long grace periods relative to repair frequency that they're not hitting these things in practice?)

> should expose 'time since last successful repair' for easier aes monitoring
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2405
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2405
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>             Fix For: 0.7.5
>
>         Attachments: CASSANDRA-2405.patch
>
>
> The practical implementation issues of actually ensuring repair runs is somewhat of an undocumented/untreated issue.
> One hopefully low hanging fruit would be to at least expose the time since last successful repair for a particular column family, to make it easier to write a correct script to monitor for lack of repair in a non-buggy fashion.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira