cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuki Morishita (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5426) Redesign repair messages
Date Wed, 03 Apr 2013 23:43:15 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621523#comment-13621523
] 

Yuki Morishita commented on CASSANDRA-5426:
-------------------------------------------

Work in progress is pushed to: https://github.com/yukim/cassandra/commits/5426-1

Only implemented for normal case that works.

--

First of all, ActiveRepairService is broken down to several classes and placed into o.a.c.repair
to make my work easier.

The main design change around messages is that, all repair related message is packed into
RepairMessage and handled in RepairMessageVerbHandler, which is executed in ANTY_ENTROPY stage.
RepairMessage carries RepairMessageHeader and its content(if any). RepairMessageHeader is
basically to indicate that the message belongs to which repair job and to specify content
type. Repair message content type currently has 6 types defined in RepairMessageType: VALIDATION_REQUEST,
VALIDATION_COMPLETE, VALIDATION_FAILED, SYNC_REQUEST, SYNC_COMPLETE, and SYNC_FAILED.

*VALIDATION_REQUEST*

VALIDATION_REQUEST is sent from repair initiator(coordinator) to request Merkle tree.

*VALIDATION_COMPLETE*/*VALIDATION_FAILED*

Calculated Merkle tree is sent back using VALIDATION_COMPLETE message. VALIDATION_FAILED message
is used when something goes wrong in remote node.

*SYNC_REQUEST*

SYNC_REQUEST is sent when we have to repair remote two nodes. This is forwarded StreamingRepairTask
we have today.

*SYNC_COMPLETE*/*SYNC_FAILED*

When there is no need to exchange data, or need to exchange but completed streaming, the node(this
includes the node that received SYNC_REQUEST) sends back SYNC_COMPLETE. If streaming data
fails, sends back SYNC_FAILED.

The whole repair process is depend on async message exchange using MessagingService, so there
is still the chance to hang when the node fail to deliver message(see CASSANDRA-5393).

Any feedback is appreciated.
                
> Redesign repair messages
> ------------------------
>
>                 Key: CASSANDRA-5426
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5426
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Yuki Morishita
>            Assignee: Yuki Morishita
>            Priority: Minor
>             Fix For: 2.0
>
>
> Many people have been reporting 'repair hang' when something goes wrong.
> Two major causes of hang are 1) validation failure and 2) streaming failure.
> Currently, when those failures happen, the failed node would not respond back to the
repair initiator.
> The goal of this ticket is to redesign message flows around repair so that repair never
hang.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message