cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Cheng <br...@blockcypher.com>
Subject Re: Repair Hangs while requesting Merkle Trees
Date Tue, 17 Nov 2015 00:23:48 GMT
Hi Anuj,

Did you mean streaming_socket_timeout_in_ms? If not, then you definitely
want that set. Even the best network connections will break occasionally,
and in Cassandra < 2.1.10 (I believe) this would leave those connections
hanging indefinitely on one end.

How far away are your two DC's from a network perspective, out of
curiosity? You'll almost certainly be doing different TCP stack tuning for
cross-DC, notably your buffer sizes, window params, cassandra-specific
stuff like otc_coalescing_strategy, inter_dc_tcp_nodelay, etc.

On Sat, Nov 14, 2015 at 10:35 AM, Anuj Wadehra <anujw_2003@yahoo.co.in>
wrote:

> One more observation.We observed that there are few TCP connections which
> node shows as Established but when we go to node at other end,connection is
> not there. They are called "phantom" connections I guess. Can this be a
> possible cause?
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
> ------------------------------
> *From*:"Anuj Wadehra" <anujw_2003@yahoo.co.in>
> *Date*:Sat, 14 Nov, 2015 at 11:59 pm
>
> *Subject*:Re: Repair Hangs while requesting Merkle Trees
>
> Thanks Daemeon !!
>
> I wil capture the output of netstats and share in next few days. We were
> thinking of taking tcp dumps also. If its a network issue and increasing
> request timeout worked, not sure how Cassandra is dropping messages based
> on timeout.Repair messages are non droppable and not supposed to be
> timedout.
>
> 2 of the 3 nodes in the DC are able to complete repair without any issue.
> Just one node is problematic.
>
> I also observed frequent messages in logs of other nodes which say that
> hints replay timedout..and the node where hints were being replayed is
> always a remote dc node. Is it related some how?
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
> ------------------------------
> *From*:"daemeon reiydelle" <daemeonr@gmail.com>
> *Date*:Thu, 12 Nov, 2015 at 10:34 am
> *Subject*:Re: Repair Hangs while requesting Merkle Trees
>
>
> Have you checked the network statistics on that machine? (netstats -tas)
> while attempting to repair ... if netstats show ANY issues you have a
> problem. If you can put the command in a loop running every 60 seconds for
> maybe 15 minutes and post back?
>
> Out of curiousity, how many remote DC nodes are getting successfully
> repaired?
>
>
>
> *.......*
>
>
>
>
>
>
> *“Life should not be a journey to the grave with the intention of arriving
> safely in apretty and well preserved body, but rather to skid in broadside
> in a cloud of smoke,thoroughly used up, totally worn out, and loudly
> proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
> (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra <anujw_2003@yahoo.co.in>
> wrote:
>
>> Hi,
>>
>> we are using 2.0.14. We have 2 DCs at remote locations with 10GBps
>> connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only
>> one node in DC2, we are unable to complete repair as it always hangs. Node
>> sends Merkle Tree requests, but one or more nodes in DC1 (remote) never
>> show that they sent the merkle tree reply to requesting node.
>> Repair hangs infinitely.
>>
>> After increasing request_timeout_in_ms on affected node, we were able to
>> successfully run repair on one of the two occassions.
>>
>> Any comments, why this is happening on just one node? In
>> OutboundTcpConnection.java,  when isTimeOut method always returns false for
>> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
>> increasing request timeout solved problem on one occasion ?
>>
>>
>> Thanks
>> Anuj Wadehra
>>
>>
>>
>> On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <
>> anujw_2003@yahoo.co.in> wrote:
>>
>>
>> Hi,
>>
>> We have 2 DCs at remote locations with 10GBps connectivity.We are able to
>> complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are
>> unable to complete repair as it always hangs. Node sends Merkle Tree
>> requests, but one or more nodes in DC1 (remote) never show that they sent
>> the merkle tree reply to requesting node.
>> Repair hangs infinitely.
>>
>> After increasing request_timeout_in_ms on affected node, we were able to
>> successfully run repair on one of the two occassions.
>>
>> Any comments, why this is happening on just one node? In
>> OutboundTcpConnection.java,  when isTimeOut method always returns false for
>> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
>> increasing request timeout solved problem on one occasion ?
>>
>>
>> Thanks
>> Anuj Wadehra
>>
>>
>>
>

Mime
View raw message