cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anuj Wadehra (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7904) Repair hangs
Date Wed, 11 Nov 2015 20:52:11 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001053#comment-15001053
] 

Anuj Wadehra edited comment on CASSANDRA-7904 at 11/11/15 8:51 PM:
-------------------------------------------------------------------

[#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We are facing this
issue in 2.0.14. You have marked it a duplicate of CASSANDRA-7909 which was fixed in 2.0.11
so the issue must not be there in 2.0.14.

We have 2 DCs at remote locations with 10GBps connectivity.On only one node in DC2, we are
unable to complete repair (-par -pr) as it always hangs. Node sends Merkle Tree requests,
but one or more nodes in DC1 (remote) never show that they sent the merkle tree reply to requesting
node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to successfully run
repair on one of the two occassions.

I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple possible issues
there:
1. Scenario where 2 consecutive merkle tree requests fail is not handled. No Exception is
printed in logs in such a case, tpstats also dont display repair messages as dropped and repair
will hang infinitely.
2. Only IOException leads to retry of a request. In case some Runtime Exception occurs, no
retry is done and exception is written at DEBUG instead of ERROR. Repair should hang here
too.
3. When isTimeOut method always returns false for non-droppable message such as Merkle Tree
Request(verb=REPAIR_MESSAGE),why increasing request timeout is solving problem of many people
-[#Duncan Sands],[#Razi Khaja] and me. Is the logic broken?
4. Increasing request timeout can only be a temporary workaround not a fix. Root Cause Analysis
of problem and permanent fix is needed.
 


was (Author: eanujwa):
[#Aleksey Yeschenko] I am sorry. I think that this issue must be reopened. We are facing this
issue in 2.0.14. You have marked it a duplicate of CASSANDRA-7909 which was fixed in 2.0.11
so the issue must not be there in 2.0.14.

We have 2 DCs at remote locations with 10GBps connectivity.On only one node in DC2, we are
unable to complete repair (-par -pr) as it always hangs. Node sends Merkle Tree requests,
but one or more nodes in DC1 (remote) never show that they sent the merkle tree reply to requesting
node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to successfully run
repair on one of the two occassions.

I analyzed some code in OutboundTcpConnection.java of 2.0.14 and see multiple possible issues
there:
1. Scenario where 2 consecutive merkle tree requests fail is not handled. No Exception is
printed in logs in such a case, tpstats also dont display repair messages as dropped and repair
will hang infinitely.
2. Only IOException leads to retry of a request. In case some Runtime Exception occurs, no
retry is done and exception is written at DEBUG instead of ERROR. Repair should hang here
too.
3. When isTimeOut method always returns false for non-droppable message such as Merkle Tree
Request(verb=REPAIR_MESSAGE),why increasing request timeout is solving problem of many people?
Is the logic broken?

Exception handling must be improved. Its impossible to troubleshoot such issue in PROD, as
no relevant error is logged.

> Repair hangs
> ------------
>
>                 Key: CASSANDRA-7904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, java version
"1.7.0_45"
>            Reporter: Duncan Sands
>         Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so repair is
run on all nodes (in a staggered fashion) on the weekend.  Nodetool options: -par -pr.  There
is usually some overlap in the repairs: repair on one node may well still be running when
repair is started on the next node.  Repair hangs for some of the nodes almost every weekend.
 It hung last weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last restarted.  This
node is 192.168.60.136 and the exception is harmless: a client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in AntiEntropySessions.  These
nodes all have Active => 1 and Pending => 1.  The nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
>     pending tasks: 0
>     Active compaction remaining time :        n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the nodes listed
in the tpstats section above) has (note the Responses Pending value of 1):
>     Mode: NORMAL
>     Not sending any streams.
>     Read Repair Statistics:
>     Attempted: 4233
>     Mismatch (Blocking): 0
>     Mismatch (Background): 243
>     Pool Name                    Active   Pending      Completed
>     Commands                        n/a         0       34785445
>     Responses                       n/a         1       38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes mentioned
in tpstats above I found that they had sent merkle tree requests and got responses from all
but one node.  In the log file for the node that failed to respond there is no sign that it
ever received the request.  On 1 node (172.18.68.138) it looks like responses were received
from every node, some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
>     Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, /172.18.68.139,
/172.18.68.138, /172.18.33.22, /192.168.21.13 for table brokers, never got a response from
/172.18.68.139.  On /172.18.68.139, just before this time it sent a response for the same
repair session but a different table, and there is no record of it receiving a request for
table brokers.
>   Node 192.168.60.134 (data centre A):
>     Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, /192.168.21.14,
/192.168.60.134 for table swxess_outbound, never got a response from /172.18.68.138.  On /172.18.68.138,
just before this time it sent a response for the same repair session but a different table,
and there is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
>     Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for table rollups7200,
never got a response from /172.18.68.139.  This repair session is never mentioned in the /172.18.68.139
log.
>   Node 172.18.68.138 (data centre Z):
>     The issue here seems to be repair session #a55c16e1-35eb-11e4-8e7e-51c077eaf311.
 It got responses for all its merkle tree requests, did some streaming, but seems to have
stopped after finishing with one table (rollups60).  I found it as follows: it is the only
repair for which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message