Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A3485D778 for ; Mon, 1 Oct 2012 17:47:08 +0000 (UTC) Received: (qmail 26276 invoked by uid 500); 1 Oct 2012 17:47:08 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 26231 invoked by uid 500); 1 Oct 2012 17:47:08 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 26128 invoked by uid 99); 1 Oct 2012 17:47:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2012 17:47:08 +0000 Date: Tue, 2 Oct 2012 04:47:08 +1100 (NCT) From: "Mina Naguib (JIRA)" To: commits@cassandra.apache.org Message-ID: <109754993.148957.1349113628270.JavaMail.jiratomcat@arcas> Subject: [jira] [Created] (CASSANDRA-4740) Phantom TCP connections, failing hinted handoff MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Mina Naguib created CASSANDRA-4740: -------------------------------------- Summary: Phantom TCP connections, failing hinted handoff Key: CASSANDRA-4740 URL: https://issues.apache.org/jira/browse/CASSANDRA-4740 Project: Cassandra Issue Type: Bug Affects Versions: 1.1.2 Environment: Linux 3.4.9, java 1.6.0_35-b10 Reporter: Mina Naguib Priority: Minor IP addresses in report anonymized: Had a server running cassandra (1.1.1.10) reboot ungracefully. Reboot and startup was successful and uneventful. cassandra went back into service ok. >From that point onwards however, several (but not all) machines in the cassandra cluster started having difficulty with hinted handoff to that machine. This was despite nodetool ring showing Up across the board. Here's an example of an attempt, every 10 minutes, by a node (1.1.1.11) to replay hints to the node that was rebooted: {code} INFO [HintedHandoff:1] 2012-10-01 11:07:23,293 HintedHandOffManager.java (line 294) Started hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10 INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 INFO [HintedHandoff:1] 2012-10-01 11:17:23,312 HintedHandOffManager.java (line 294) Started hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10 INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 INFO [HintedHandoff:1] 2012-10-01 11:27:23,335 HintedHandOffManager.java (line 294) Started hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10 INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 INFO [HintedHandoff:1] 2012-10-01 11:37:23,357 HintedHandOffManager.java (line 294) Started hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10 INFO [HintedHandoff:1] 2012-10-01 11:37:33,358 HintedHandOffManager.java (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries INFO [HintedHandoff:1] 2012-10-01 11:37:33,359 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 INFO [HintedHandoff:1] 2012-10-01 11:47:23,412 HintedHandOffManager.java (line 294) Started hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10 INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java (line 372) Timed out replaying hints to /1.1.1.10; aborting further deliveries INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10 {code} I started poking around, and discovered that several nodes held ESTABLISHED TCP connections that didn't have a live endpoint on the rebooted node. My guess is they were live prior to the reboot, and after the reboot the nodes still see them as live and unsuccessfully try to use them. Example, on the node that was rebooted: {code} .10 ~ # netstat -tn | grep 1.1.1.11 tcp 0 0 1.1.1.10:7000 1.1.1.11:40960 ESTABLISHED tcp 0 0 1.1.1.10:34370 1.1.1.11:7000 ESTABLISHED tcp 0 0 1.1.1.10:45518 1.1.1.11:7000 ESTABLISHED {code} While on the node that's failing to hint to it: {code} .11 ~ # netstat -tn | grep 1.1.1.10 tcp 0 0 1.1.1.11:7000 1.1.1.10:34370 ESTABLISHED tcp 0 0 1.1.1.11:7000 1.1.1.10:45518 ESTABLISHED tcp 0 0 1.1.1.11:7000 1.1.1.10:53316 ESTABLISHED tcp 0 0 1.1.1.11:7000 1.1.1.10:43239 ESTABLISHED tcp 0 0 1.1.1.11:40960 1.1.1.10:7000 ESTABLISHED {code} Notice the phantom connections on :53316 and :43239 which do not appear on the remote 1.1.1.10 On .11 I tried disabling and enabling gossip, but that did not restart :7000 nor clean up the 2 phantom connections. For good measure I also tried disabling and enabling thrift (long shot), and that didn't help either. The only thing that helped was to actually stop and start cassandra, in a rolling fashion, on each node that was having trouble hinting to the machine that was rebooted. The phantom connections naturally disappeared, write volume on 1.1.1.10 rose for a while, and all the hints were sent successfully. I'm unsure whether the phantom TCP connections are a cause, or just loosely correlated, to the hinted handoff failure every 10 minutes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira