cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anuj Wadehra <anujw_2...@yahoo.co.in>
Subject Re: Repair Hangs while requesting Merkle Trees
Date Sun, 29 Nov 2015 17:43:15 GMT
Hi All,

I am summarizing the setup, problem & key observations till now:

Setup: Cassandra 2.0.14. 2 DCs with 3 nodes each connected via 10Gbps VPN. We run repair with
-par and -pr option.
Problem: Repair Hangs. Merkle Tree Responses are not received from one or more nodes in remote
DC.

Observations till now:
1. Repair hangs intermittently on one node of  DC2.. Only on one occasion, repair hung on
one other node in DC2 too.
2. Mostly, the node from which Merkle tree was not received does NOT have any message "Sending
completed merkle tree .." in logs.
3. Often Hinted Handoffs get triggered across DCs and hint replays time-out.
4. Many times, when repair is run after long time it FAILS initially. But, if we restart Cassandra
and re-run repair , it SUCCEEDS.

Logs: DEBUG logs Attached.

Observations from Log:1. When we started repair on 10.X.15.115, we got error messages "error
writing to /X.X.X.X
java.io.IOException: Connection timed out" for 2 nodes in remote DC: 10.X.14.113 and 10.X.14.111.
Merkle tree were received from these 2 nodes.

2. Merkle Tree reponse was not received from 3rd node in remote DC: 10.X.14.115 (for which
no error occurred)

3. Hinted handoff started for 3rd node (10.X.14.115 ) but hint replay timed-out.
If it's a network issue then why the issue is only in DC2 and mostly observed on one node.

ThanksAnuj 


    On Sunday, 29 November 2015 10:44 PM, Anuj Wadehra <anujw_2003@yahoo.co.in> wrote:
 

 Yes. I think you are correct, problem might have resolved via Cassandra restart rather than
increasing request timeout.

We are NOT on EC2. We have 2 interfaces on each node: one private and one public.
We have strange configuration and we need to correct it as per the recommendation at https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html
. 

AS-IS config:
We use broadcast address=listen address=PUBLIC IP address. 
In seeds, we put PUBLIC IP of other nodes but private IP for the local node. There were some
issues if we tried to access local node via its public IP.


Thanks
Anuj
 
--------------------------------------------
On Tue, 24/11/15, Paulo Motta <pauloricardomg@gmail.com> wrote:

 Subject: Re: Repair Hangs while requesting Merkle Trees
 To: "user@cassandra.apache.org" <user@cassandra.apache.org>, "Anuj Wadehra" <anujw_2003@yahoo.co.in>
 Date: Tuesday, 24 November, 2015, 12:38 AM
 
 The issue might be related to the
 ESTABLISHED connections just in one end. I don't think
 it might be related to inter_dc_tcp_nodelay or
 request_timeout_in_ms options. Did you restart the process
 when you changed the request_timeout_in_ms option? This
 might be why the problem got fixed and not the option
 change.
 
 This seem
 like a network issue or a misconfiguration of this specific
 node. Are you using EC2? Is listen_address ==
 broadcast_address? Are all nodes using the same
 configuration? What java are you using?
 
 You may want to enable TRACE on
 OutgoingTcpConnection and IncomingTcpConnection and compare
 the outputs of healthy nodes with the faulty node.
 
 2015-11-23 10:04 GMT-08:00
 Anuj Wadehra <anujw_2003@yahoo.co.in>:
 Any
 comments on ESTABLISHED connections at one end?
 
 
 
 Moreover, inter_dc_tcp_nodelay is false. Can this be the
 reason that  latency between two DC is more and repair
 messages are getting dropped?
 
 
 
 Can increasing request_timeout_in_ms deal with the latency
 issue..
 
 
 
 I see some hinted handoffs being triggered for cross DC
 nodes..and hints replay being timed-out..Is that an
 indication of a network issue?
 
 
 
 I am getting in tough with network team to capture netstats
 and tcpdump too..
 
 
 
 Thanks
 
 Anuj
 
 
 
 
 
 --------------------------------------------
 
 On Wed, 18/11/15, Anuj Wadehra
 <anujw_2003@yahoo.co.in>
 wrote:
 
 
 
  Subject: Re: Repair Hangs while requesting Merkle Trees
 
  To: "user@cassandra.apache.org"
 <user@cassandra.apache.org>
 
  Date: Wednesday, 18 November, 2015, 7:57 AM
 
 
 
  Thanks Bryan !!
 
  Connection
 
  is in ESTBLISHED state on on end and completely missing
 at
 
  other end (in another dc).
 
  Yes,
 
  we can revisit TCP tuning.But the problem is node
 specific.
 
  So not sure whether tuning is the culprit.
 
 
 
  ThanksAnuj
 
  Sent
 
  from Yahoo Mail on Android  From:"Bryan
 
  Cheng" <bryan@blockcypher.com>
 
  Date:Wed, 18 Nov, 2015 at
 
   2:04 am
 
  Subject:Re: Repair Hangs
 
  while requesting Merkle Trees
 
 
 
   Ah OK, might
 
  have misunderstood you. Streaming socket should not be
 in
 
  play during merkle tree generation (validation
 compaction).
 
  They may come in play during merkle tree exchange- that
 
  I'm not sure about. You can read a bit more here: https://issues.apache.org/jira/browse/CASSANDRA-8611.
 
  Regardless, you should have it set-
 
  1 hr is usually a good conservative estimate, but you can
 go
 
  much lower safely.
 
  What state is the connection on that
 
  only shows on one side? Is it ESTABLISHED, or something
 like
 
  CLOSE_WAIT?
 
  Here's
 
  a good place to start for tuning, though it doesn't
 have
 
  as much about network tuning: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html.
 
  More generally, TCP tuning usually revolves around a
 balance
 
  between latency and bandwidth. Over long connections
 
  (we're talking 10s of ms, instead of the sub 1ms
 you
 
  usually see in a good dc network), your expectations
 will
 
  shift greatly. Stuff like NODELAY on tcp is very nice
 for
 
  cutting your latencies when you're inside a DC, but
 will
 
  generate lots of small packets that will hurt your
 bandwidth
 
  over longer connections due to the need to wait for
 acks.
 
  otc_coalescing_strategy is on a similar vein, bundling
 
  together nearby messages to trade latency for
 throughput.
 
  You'll also probably want to tune your tcp buffers
 and
 
  window sizes, since that determines how much data can
 be
 
  in-flight between acknowledgements, and the default size
 is
 
  pitiful for any decent  network size. Google
 
   around for TCP tuning/buffer tuning and you should
 find
 
  some good resources.
 
  On Mon, Nov 16, 2015 at
 
  5:23 PM, Anuj Wadehra <anujw_2003@yahoo.co.in>
 wrote:
 
  Hi Bryan,
 
  Thanks for the reply !!I
 
  didnt mean streaming_socket_tomeout_in_ms. I meant when
 you
 
  run netstats (Linux cmnd) on  node A in DC1, you will
 
  notice that there is connection in established state
 with
 
  node B in DC2. But when you run netstats on node B, you
 wont
 
   find any connection with node A. Such connections are
 there
 
  across dc? Is it a problem.
 
  We havent set
 
  streaming_socket_timeout_in_ms which I know must be set.
 But
 
  I am not  sure wtheher setting this property has any
 impact
 
  on merkle tree requests. I thought its valid for data
 
  streaming if some mismatch is
 
   found and data needs to be streamed.Please confirm.
 Whats
 
  the value you use for streaming socket
 
  timeout?
 
  Morever, if
 
  socket timeout is the issue, that should happen on
 other
 
  nodes too...repair is not running on just one node, as
 
  merkle tree request is getting lost n not transmitted to
 one
 
  or more nodes in remote dc.
 
  I am not sure about exact distance.
 
  But they are connected with a very high speed 10gbps
 
  link.
 
  When you say
 
  different TCP stack tuning..do u have any
 document/blog/link
 
  describing recommendations for multi Dc Cassandra
 setup? 
 
  Can you elaborate what all settings
 
   need to be different? 
 
 
 
  ThanksAnuj
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  Sent
 
  from Yahoo Mail on Android  From:"Bryan
 
  Cheng" <bryan@blockcypher.com>
 
  Date:Tue, 17 Nov, 2015 at 5:54
 
  am
 
  Subject:Re: Repair
 
   Hangs while requesting Merkle Trees
 
 
 
   Hi Anuj,
 
  Did you mean
 
  streaming_socket_timeout_in_ms? If not, then you
 definitely
 
  want that set. Even the best network connections will
 break
 
  occasionally, and in Cassandra < 2.1.10 (I believe)
 this
 
  would leave those connections hanging indefinitely on
 one
 
  end.
 
  How far away are
 
  your two DC's from a network perspective, out of
 
  curiosity? You'll almost certainly be doing
 different
 
  TCP stack tuning for cross-DC, notably your buffer
 sizes,
 
  window params, cassandra-specific stuff like
 
  otc_coalescing_strategy, inter_dc_tcp_nodelay,
 
  etc.
 
  On Sat, Nov 14, 2015 at
 
  10:35 AM, Anuj Wadehra <anujw_2003@yahoo.co.in>
 wrote:
 
  One more observation.We observed
 
  that there are few TCP connections which node shows as
 
  Established but when we go to node at other
 end,connection
 
  is not there. They are called "phantom"
 
  connections I guess. Can this be a possible cause?
 
  ThanksAnuj
 
 
 
  Sent
 
  from Yahoo Mail on Android  From:"Anuj
 
  Wadehra" <anujw_2003@yahoo.co.in>
 
  Date:Sat, 14 Nov, 2015 at 11:59
 
  pm
 
  Subject:Re: Repair Hangs
 
  while
 
   requesting Merkle Trees
 
 
 
   Thanks Daemeon
 
  !!
 
  I wil capture the output
 
  of netstats and share in next few days. We were thinking
 of
 
  taking tcp dumps also. If its a network issue and
 increasing
 
  request timeout worked, not sure how Cassandra is
 dropping
 
  messages based on timeout.Repair messages are non
 droppable
 
  and not supposed to be timedout.
 
  2 of the 3 nodes in the DC are able
 
  to complete repair without any issue. Just one node is
 
  problematic.
 
  I also observed
 
  frequent messages in logs of other
 
   nodes which say that hints replay timedout..and the
 node
 
  where hints were being replayed is always a remote dc
 
   node. Is it related some how?
 
  ThanksAnujSent
 
  from Yahoo Mail on Android  From:"daemeon
 
  reiydelle" <daemeonr@gmail.com>
 
  Date:Thu, 12 Nov, 2015 at 10:34 am
 
  Subject:Re: Repair Hangs while
 
  requesting Merkle Trees
 
 
 
 
 
   Have you checked the network
 
  statistics on that machine? (netstats -tas) while
 attempting
 
  to repair ... if netstats show ANY issues
 
   you have a problem. If you can put the command in a
 loop
 
  running every 60 seconds for maybe 15 minutes and post
 
  back?
 
 
 
  Out of curiousity,
 
  how many remote DC nodes are getting successfully
 
  repaired?
 
 
 
 
 
  .......
 
  “Life should not be a journey to the
 
  grave with the intention of
 
   arriving safely in a
 
  pretty and well
 
  preserved body, but rather to skid
 
   in broadside in a cloud of smoke,
 
  thoroughly used up, totally worn out,
 
   and loudly proclaiming “Wow! What a Ride!”
 
  - Hunter Thompson
 
 
 
  Daemeon C.M. Reiydelle
 
  USA (+1)
 415.501.0198
 
  London (+44) (0)
 
  20 8144 9872
 
 
 
 
 
  On Wed, Nov 11, 2015 at
 
  1:06 PM, Anuj Wadehra <anujw_2003@yahoo.co.in>
 wrote:
 
  Hi,
 
  we are using 2.0.14. We
 
   have 2 DCs at remote locations with 10GBps
 connectivity.We
 
  are able to
 
  complete repair (-par -pr) on 5 nodes. On only one node
 in
 
  DC2, we are
 
  unable to complete repair as it always hangs. Node
 sends
 
  Merkle Tree
 
  requests, but one or more nodes in DC1 (remote) never
 show
 
  that they
 
  sent the merkle tree reply to requesting node.
 
  Repair hangs infinitely.
 
 
 
  After increasing request_timeout_in_ms on
 
  affected node, we were able to successfully run repair
 on
 
  one of the two occassions.
 
 
 
  Any
 
   comments, why this is happening on just one node? In
 
  OutboundTcpConnection.java,  when isTimeOut method
 always
 
  returns false
 
  for non-droppable verb such as Merkle Tree
 
  Request(verb=REPAIR_MESSAGE),why increasing request
 timeout
 
  solved
 
  problem on one occasion ?
 
 
 
  Thanks
 
  Anuj Wadehra
 
 
 
 
 
 
 
       On Thursday, 12
 
  November 2015 2:35 AM, Anuj Wadehra <anujw_2003@yahoo.co.in>
 wrote:
 
 
 
 
 
   Hi,
 
  We have 2 DCs at remote
 
  locations with 10GBps connectivity.We are able to
 complete
 
  repair (-par -pr) on 5 nodes. On only one node in DC2,
 we
 
  are unable to complete repair as it always hangs. Node
 sends
 
  Merkle Tree requests, but one or more nodes in DC1
 (remote)
 
  never show that they sent the merkle tree reply to
 
  requesting node.
 
  Repair hangs infinitely.
 
 
 
 
 
  After increasing
 
  request_timeout_in_ms on affected node, we were able to
 
  successfully run repair on one of the two occassions.
 
 
 
  Any comments, why this is
 
  happening on just one node? In
 OutboundTcpConnection.java, 
 
  when isTimeOut method always returns false for
 non-droppable
 
  verb such as Merkle Tree
 Request(verb=REPAIR_MESSAGE),why
 
  increasing
 
   request timeout solved problem on one occasion ?
 
 
 
  Thanks
 
  Anuj Wadehra
 
 
 
 
 
 
 
 
 
 
 
 
 
 

  
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message