Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D49CB10339 for ; Fri, 28 Feb 2014 02:45:34 +0000 (UTC) Received: (qmail 4011 invoked by uid 500); 28 Feb 2014 02:45:27 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 3717 invoked by uid 500); 28 Feb 2014 02:45:21 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 3680 invoked by uid 99); 28 Feb 2014 02:45:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Feb 2014 02:45:20 +0000 Date: Fri, 28 Feb 2014 02:45:20 +0000 (UTC) From: "Ananthkumar K S (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-6772) Cassandra inter data center communication broken MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915386#comment-13915386 ] Ananthkumar K S commented on CASSANDRA-6772: -------------------------------------------- [~brandon.williams] Agreed. It that's so, can someone let me know if it's a bug. Moreover, a firewall problem won't let the TCP connections happen between the two nodes. But here, as I mentioned, cassandra was retrying at the network layer and it was visible in netstat in both the server. We cannot replicate such a scenario as we have 60 other applications running on the same private link. So, as an use case, it should be a normal scenario in cassandra in detect and establish the connection once the connection comes up. When I reported a similar kind of a problem , an infinite loop was introduced to nullify these kind I race conditions. But it doesn't solve the problem but creates more load on TCP. Can you please review that part for such a scenario? > Cassandra inter data center communication broken > ------------------------------------------------ > > Key: CASSANDRA-6772 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6772 > Project: Cassandra > Issue Type: Bug > Environment: CentOS 6.0 > Reporter: Ananthkumar K S > Priority: Blocker > > I have two data enters DC1 and DC2. Both communicate via a private link. Yesterday, we had a problem with a private link for 10 mins. From the time the problem was resolved, nodes in both data centers are not able to communicate with each other. When I do a nodetool status on a node in DC1, the nodes in DC2 are stated as down. When tried in DC2, nodes in DC1 are shown as down . > But in the cassandra logs, we can clearly see that handshaking is failing every 5 seconds for communication between data centres. At TCP level, there are too many fin_wait1 generated by cassandra which is still a puzzle . Closed_wait top transitions due to this is very high. Due to this kind of problem of TCP listen drops, we moved from 2.0.1 to 2.0.3. In 2.0.1, it was within data center itself. But here it's between data centers. If it has anything to do with the snitch configuration, I am using GossipingPropertyFileSnitch. > This clearly started happening post private link failure. Any idea on this? > Cassandra version used is 2.0.3 -- This message was sent by Atlassian JIRA (v6.1.5#6160)