Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2E56696A7 for ; Mon, 27 Feb 2012 19:07:40 +0000 (UTC) Received: (qmail 71931 invoked by uid 500); 27 Feb 2012 19:07:40 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 71905 invoked by uid 500); 27 Feb 2012 19:07:40 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 71897 invoked by uid 99); 27 Feb 2012 19:07:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Feb 2012 19:07:40 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Feb 2012 19:07:37 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id BE0B933CB1C for ; Mon, 27 Feb 2012 19:06:46 +0000 (UTC) Date: Mon, 27 Feb 2012 19:06:46 +0000 (UTC) From: "Pavel Yaskevich (Commented) (JIRA)" To: commits@cassandra.apache.org Message-ID: <2120981065.24030.1330369606780.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1744205682.1229.1317557194162.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (CASSANDRA-3294) a node whose TCP connection is not up should be considered down for the purpose of reads and writes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217398#comment-13217398 ] Pavel Yaskevich commented on CASSANDRA-3294: -------------------------------------------- After reading CASSANDRA-3722 it seems we can implement required logic at the snitch level taking latency measurements into account. I think we can close this one in favor of CASSANDRA-3722 and continue work/discussion there. What do you think, Brandon, Peter? > a node whose TCP connection is not up should be considered down for the purpose of reads and writes > --------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-3294 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3294 > Project: Cassandra > Issue Type: Improvement > Reporter: Peter Schuller > Assignee: Peter Schuller > > Cassandra fails to handle the most simple of cases intelligently - a process gets killed and the TCP connection dies. I cannot see a good reason to wait for a bunch of RPC timeouts and thousands of hung requests to realize that we shouldn't be sending messages to a node when the only possible means of communication is confirmed down. This is why one has to "disablegossip and wait for a while" to restar a node on a busy cluster (especially without CASSANDRA-2540 but that only helps under certain circumstances). > A more generalized approach where by one e.g. weights in the number of currently outstanding RPC requests to a node, would likely take care of this case as well. But until such a thing exists and works well, it seems prudent to have the very common and controlled form of "failure" be handled better. > Are there difficulties I'm not seeing? > I can see that one may want to distinguish between considering something "really down" (and e.g. fail a repair because it's down) from what I'm talking about, so maybe there are different concepts (say one is "currently unreachable" rather than "down") being conflated. But in the specific case of sending reads/writes to a node we *know* we cannot talk to, it seems unnecessarily detrimental. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira