cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2405) should expose 'time since last successful repair' for easier aes monitoring
Date Mon, 27 Jun 2011 12:44:47 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055507#comment-13055507
] 

Sylvain Lebresne commented on CASSANDRA-2405:
---------------------------------------------

I'm sorry but I still think we are still returning the wrong number to the user. To be clear,
this is nothing against the code of the patch itself, I just think that given the way repair
works, it is not so simple to have a "time since last successful repair".

The "unit" of a repair is for a given keyspace, column family and range. Because of that,
I don't think we can return a single "time since last successful repair" for a given keyspace
and column family. It has to include the range somehow. Granted, so far a nodetool repair
repairs all the ranges of the node you launch it on, but I don't think this should be the
case (CASSANDRA-2610). Moreover, even now, one of the range can fail without the other. So
returning only one number for all ranges is wrong.

The other problem is: I'm not convinced that recording the information only on the node coordinating
the repair is necessarily super helpful. When you start a repair a node, you will also repair
its neighbor (for only the range they share), so recording the time only on the initial node
on which the nodetool command was connected is random, and will convey the idea that repair
should be started for every range on every node (while I strongly thing that the short term
goal should be to make it easy to NOT do that -- CASSANDRA-2610 again).

Imho, we should hold back on this issue for now and at least wait for CASSANDRA-2610, CASSANDRA-2606
and CASSANDRA-2816 before committing to anything. I agree that having information to help
people plan repair is nice, but it is at most a very minor improvement and exposing a misleading
number is more harmful that no number.


> should expose 'time since last successful repair' for easier aes monitoring
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2405
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2405
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>             Fix For: 0.8.2
>
>         Attachments: CASSANDRA-2405-v2.patch, CASSANDRA-2405-v3.patch, CASSANDRA-2405-v4.patch,
CASSANDRA-2405.patch
>
>
> The practical implementation issues of actually ensuring repair runs is somewhat of an
undocumented/untreated issue.
> One hopefully low hanging fruit would be to at least expose the time since last successful
repair for a particular column family, to make it easier to write a correct script to monitor
for lack of repair in a non-buggy fashion.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message